diff --git a/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md b/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md index 8cc33e9..6aee1cd 100644 --- a/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md +++ b/vol-1/part-1-thesis-and-pain/ch01-when-saas-fights-reality.md @@ -1,34 +1,32 @@ # Chapter 1 - When SaaS Fights Reality - + --- -It's two in the afternoon in Pune, and Sunita Kulkarni, the project manager on a $4.2 million hospital-expansion bid, is staring at a browser tab that refuses to load. Her firm's general-contractor bid is due at five, and the owner group is scheduled to meet at six. The project management platform her firm operates on has been down since eleven that morning. +It's two in the afternoon in Pune, and Sunita Kulkarni, the project manager on a $4.2 million hospital-expansion bid, is staring at a browser tab that refuses to load. The bid is due at five. The platform has been down since eleven. -The data isn't lost; it exists somewhere-on servers in Virginia, Oregon, or any other cloud region that happens to be active that day. The labor breakdown, subcontractor bids, change order history, and payment schedule-all of it remains intact on a hard drive Sunita will never access, in a building she couldn't find on a map. It's simply inaccessible. The vendor's status page claims it's an outage affecting less than 1% of users. On this bid, that 1% is everyone. +The data isn't lost. It exists on servers in Virginia or Oregon — intact, on a hard drive Sunita will never access, in a building she couldn't find on a map. It's simply inaccessible. The vendor's status page calls it an outage affecting less than 1% of users. On this bid, that 1% is everyone. -As the clock ticks down, Sunita's options dwindle. She can only reconstruct what she can from an email trail, export a stale PDF from before the platform went down, or ask her client to extend the deadline. But that would require explaining to the board what happened and why the firm wasn't prepared. +This isn't a planning failure. Sunita planned correctly; her team had used the software. The failure is structural: her data resides on infrastructure she doesn't control, and when that infrastructure goes offline, her capabilities go with it. -This isn't a planning failure. Sunita planned correctly, her team had used the software. Everything was in order. The failure is structural: her data resides on infrastructure she doesn't control, and when that infrastructure goes offline, her capabilities are compromised. - -This scenario repeats across various industries that rely on deadline-sensitive work-the attorney preparing a brief at nine in the evening, the engineer updating safety documentation in the field, and the physician accessing patient records before rounds. The infrastructure fails identically, but only the deadlines change. +This scenario repeats wherever deadline-sensitive work runs on cloud infrastructure — the attorney drafting a brief at nine in the evening, the engineer updating safety documentation in the field, the physician accessing records before rounds. The infrastructure fails identically. Only the deadlines change. --- ## The Bundle Nobody Agreed To -The SaaS (Software as a Service) deal goes like this. Give us your data. Keep it on our servers. Pay us every month. In exchange you get real-time collaboration, multi-device access, and zero maintenance. Most users said yes without fully registering the second half. The first half was the product. The second half was the terms. +The SaaS deal goes like this: give us your data, keep it on our servers, pay us every month. In exchange you get real-time collaboration, multi-device access, and zero maintenance. Most users said yes without fully registering the second half. The first half was the product. The second half was the terms. -The three desirable properties are real. Real-time collaboration is transformative - two people editing the same document, watching each other's changes appear, never again emailing attachments back and forth. Multi-device access means your work is on your phone when you need it at the airport. Zero maintenance means IT does not nurse a server in a closet; the vendor handles it. +The three desirable properties are real. Real-time collaboration is transformative. Multi-device access means your work follows you. Zero maintenance means IT doesn't nurse a server in a closet. -The three conditions on the other side of the bundle get less attention. Your data lives on vendor infrastructure, which means the vendor can see it, lose it, sell the company that holds it, or turn the service off. Pricing is at the vendor's discretion - the rate when you adopted the software is not a commitment. It is a starting point. Service continuity is contingent on the vendor's survival: if the company gets acquired, runs out of money, or decides to sunset the product, your software stops working when theirs does. +The three conditions on the other side get less attention. Your data lives on vendor infrastructure, which means the vendor can see it, lose it, sell the company that holds it, or shut the service off. Pricing is at the vendor's discretion — the rate at adoption is a starting point, not a commitment. Service continuity is contingent on the vendor's survival. -The acceptance was rational. Neither half of the bundle is fully visible at adoption time. The terms of service when a company signs up and the terms of service three acquisitions later are different documents. The pricing that wins a customer's business is designed to win it - not to represent what the platform costs after that customer has built their workflows, trained their staff, and transferred their data. The bundle reveals itself slowly, after the switching costs have accumulated. +The acceptance was rational, because the second half wasn't visible at adoption time. The pricing that wins a customer's business isn't calibrated to represent what the platform costs after that customer has built workflows, trained staff, and transferred data. The bundle reveals itself slowly, after switching costs have accumulated. -Users accepted these conditions because the three desirable properties appeared to *require* them. Real-time collaboration required a central server both parties could talk to. Multi-device sync required a cloud that acted as the authoritative copy. Zero maintenance required that the vendor control the infrastructure. The package looked indivisible because, with the technology of 2010, it largely was. +Users accepted these conditions because the three desirable properties appeared to *require* them. Real-time collaboration required a central server. Multi-device sync required a cloud acting as the authoritative copy. Zero maintenance required that the vendor control the infrastructure. The package looked indivisible because, with the technology of 2010, it largely was. That is no longer true. @@ -38,203 +36,163 @@ That is no longer true. ### The Outage That Takes Your Work With It -Major SaaS providers report 99.9% uptime - roughly 8.7 hours of downtime per year. For a single user, those hours scatter harmlessly across the calendar and rarely land at a bad moment. For a team of ten, at any given moment somebody is in the middle of something time-sensitive. - -Sunita Kulkarni's 8.7 hours found her at 4:47 in the afternoon, with thirteen minutes left to submit a subcontractor bid for a hospital expansion in Pune. The platform - the SaaS construction-management product her firm had standardized on the year before - had been slow all afternoon. Pages took six seconds to load instead of one. Sunita had opened the bid spreadsheet in three browser tabs that morning because she did not trust the network, and she switched between them as one slowed and another caught up. She had been carrying the bid for six weeks. Two hundred and forty-three line items. Subcontractor quotes, materials, equipment, contingency. The kind of document a construction PM keeps cleaner than her own desk. - -At 4:47 the platform stopped responding. She refreshed. Spinning indicator. She refreshed. Spinning indicator. She called her counterpart at the firm who was supposed to countersign the bid; her counterpart could not reach the platform either. Sunita tried to email the spreadsheet to the client directly - the platform's single sign-on tied her email account to the same provider, and her email was locked too. By 5:04 she had her phone in her hand watching the timestamp move past the deadline. She did not say anything when the window closed. She set the phone face-down on the desk and listened to the office around her - keyboards, voices, somebody laughing about something - and she counted the line items she had not been able to submit. Two hundred and forty-three. The bid was won by a competitor whose construction-management platform happened to run on a different vendor whose dependencies had not gone down at 4:47 that afternoon. +Major SaaS providers report 99.9% uptime — roughly 8.7 hours of downtime per year. For a single user, those hours scatter harmlessly across the calendar. For a team of ten, at any given moment somebody is in the middle of something time-sensitive. -Sunita kept three tabs open after that. She still keeps three tabs open. The tic is what she carries from the afternoon she lost the Pune hospital bid. The architecture is what eventually replaces the tic. +Sunita Kulkarni's 8.7 hours found her at 4:47 in the afternoon with thirteen minutes left to submit a subcontractor bid for the Pune hospital expansion. The platform had been slow all afternoon. At 4:47 it stopped responding entirely. She refreshed. Spinning indicator. She called her counterpart who was supposed to countersign; her counterpart couldn't reach the platform either. The platform's single sign-on tied her email to the same provider — her email was locked too. At 5:04 she watched the timestamp move past the deadline. The bid was won by a competitor whose construction-management platform ran on a different vendor whose dependencies hadn't gone down at 4:47. -The outage that gets published is the one the vendor is willing to call an outage. The incidents that affect partial regions, specific features, or specific customer cohorts surface as "degraded performance" - a phrase that does most of its work by not being the word *outage*. From the affected user's side, degraded performance means the site loads but submissions fail silently, changes save and then revert, or search returns stale results. This is harder to work around than a clean outage, because it is not obvious that the problem is the platform rather than something the user did. With a clean outage you know to stop trying. With degraded performance you keep trying - and the failure looks like something you did. +Sunita kept three tabs open after that. The tic is what she carries from the afternoon she lost the Pune hospital bid. The architecture is what eventually replaces the tic. -What makes outage risk asymmetric is that it falls hardest on the moments that matter most. High-stakes work - deadline submissions, live customer sessions, critical handoffs - tends to involve intensive platform use, which means it is more exposed to performance degradation under load. And the work that can least tolerate delay tends to be the work with external dependencies: bids due to clients, documents due to regulators, reports due to boards. These are not moments where "try again in an hour" is an option. +The outage the vendor publishes is the one it's willing to call an outage. Incidents affecting partial regions, specific features, or specific customer cohorts surface as "degraded performance" — a phrase that does most of its work by not being the word *outage*. With a clean outage you know to stop trying. With degraded performance you keep trying, and the failure looks like something you did. -Sunita's afternoon is not unusual for her industry. Construction project management is deadline-driven by definition. A subcontractor bid has a submission deadline that is not negotiable after the fact. A change order authorization has a response window tied to contract terms. A safety inspection log has a regulatory timestamp requirement. When any of these processes depends on cloud infrastructure being available exactly when needed, the infrastructure becomes a single point of failure in a workflow that cannot tolerate one. +Outage risk falls hardest on the moments that matter most. High-stakes work — deadline submissions, live customer sessions, critical handoffs — involves intensive platform use, which means it's more exposed to performance degradation under load. The work that can least tolerate delay tends to be the work with external dependencies: bids due to clients, documents due to regulators, reports due to boards. These are not moments where "try again in an hour" is an option. -Availability statistics miss a compounding factor. The concentration of cloud hosting means failures cascade across unrelated products at the same instant. The December 2021 AWS us-east-1 outage affected every product hosted there - project management tools, document collaboration platforms, file storage services, communication tools - at the same moment. A single incident becomes an industry-wide incident for everyone whose vendor chose the same region. Users who experience a simultaneous failure across multiple tools they rely on do not find redundancy in having adopted multiple platforms; they find that all their fallback options went down at the same time. This is the dependency chain. Not your vendor failing, but the infrastructure layer beneath your vendor - shared cloud regions, CDN providers, authentication services - none of which appear in your vendor's SLA (Service Level Agreement), and none of which you have any contract with. +The concentration of cloud hosting compounds this. The December 2021 AWS us-east-1 outage hit every product hosted there simultaneously — project management tools, document platforms, file storage, communication tools. Users who had adopted multiple platforms found that all their fallback options went down at the same time. Their vendor SLAs (Service Level Agreements) say nothing about the infrastructure layer beneath their vendor — shared cloud regions, CDN providers, authentication services — none of which the user has any contract with. -Outages hit hardest the users who can least work around them. Assistive technology users - those who rely on screen readers, switch access devices, or voice control software - experience SaaS connectivity failure as complete access failure. The screen reader announces a failed load. Voice control has no form fields to target. The application stops responding. Degraded performance that a connected user circumvents by refreshing is inaccessible in a more absolute sense - the AT user cannot navigate what is not there. The architecture this dissertation proposes keeps the application responsive regardless of network state. For AT users, this is not a usability improvement. It is the difference between accessible and inaccessible software. +Outages hit hardest the users who can least work around them. Assistive technology users — those who rely on screen readers, switch access devices, or voice control — experience SaaS connectivity failure as complete access failure. Degraded performance that a sighted user circumvents by refreshing is inaccessible in a more absolute sense: the screen reader announces a failed load; voice control has no form fields to target. The architecture developed in later chapters keeps the application responsive regardless of network state. For AT users, this is not a usability improvement. It is the difference between accessible and inaccessible software. ### The Vendor That Disappears -In 2015, Sunrise Calendar had a substantial mobile user base (estimated by industry coverage in the low millions) and was widely considered the best third-party calendar app for iOS. Microsoft acquired it that year. Microsoft shut it down in August 2016. Users received a few weeks' notice. The data was exportable - in a format that no other calendar app read natively, requiring manual remapping of categories and recurrence rules. +In 2015, Sunrise Calendar had a substantial mobile user base and was widely considered the best third-party calendar app for iOS. Microsoft acquired it that year and shut it down in August 2016. Users received a few weeks' notice. The data was exportable in a format no other calendar app read natively. Sunrise was not exceptional. It was typical of how software products end. -The mechanism changes - acquisition, runway exhaustion, a strategic pivot, the founder taking a job somewhere larger - but the pattern is consistent. The product goes dark. Users who built their workflows around it are left with whatever they managed to export before the deadline. +The mechanism changes — acquisition, runway exhaustion, a strategic pivot, the founder taking a job somewhere larger — but the pattern is consistent. The product goes dark. Users who built workflows around it are left with whatever they managed to export before the deadline. Salesforce acquired Quip and deprioritized it; teams that had built workflows around its document structure found the structure was stored in a format only Quip controlled. -Salesforce acquired Quip and deprioritized it; teams that had built workflows around its document structure found the investment worthless on migration because the structure was stored in a format only Quip controlled. That is not a product failure. It is the custody model working exactly as designed: the user's workflow lives on vendor infrastructure until it doesn't. +When a vendor announces shutdown, it typically offers an export. What that export contains, what format it uses, and whether any other software can consume it are highly variable. For project management data, the export is typically a CSV of the task list — without comments, without attachment history, without the relationship structure that made the tool useful. For document collaboration, most platforms offer a PDF export, which preserves the appearance but none of the editability. -The data export problem deserves specific attention. When a vendor announces shutdown, it typically offers an export function. What that export contains, what format it uses, and whether any other software can actually consume it are highly variable. For project management data, vendors typically export a CSV of the task list - without the comments, without the attachment history, without the relationship structure that made the tool useful. For document collaboration, most platforms offer a PDF export, which preserves the appearance but none of the editability. - -The legal firm whose vendor gets acquired faces this directly. They adopted the software, trained staff, integrated it with billing and document management workflows, and accumulated years of matter history. Now they evaluate whether to migrate to the acquirer's competing product under the acquirer's pricing, or start over with a third party, reconstructing what they can from a flat CSV and a folder of PDFs. - -The risk has a name that undersells it. *Vendor shutdown* sounds like a rare catastrophe. It is routine. Thousands of SaaS products shut down every year. Most are small enough that their shutdowns do not make news; their users find out through an email or a banner in the app. The shutdowns that do make news - Evernote's degraded state following years of ownership changes, Google Reader's abrupt termination in 2013 despite millions of active users, the steady stream of products acquired into enterprise platforms and starved of investment - are notable primarily because of the scale of the disruption, not because the pattern is unusual. +The risk has a name that undersells it. *Vendor shutdown* sounds like a rare catastrophe. Thousands of SaaS products shut down every year. Most are small enough that their shutdowns don't make news; their users find out through an email or a banner. The shutdowns that do make news — Google Reader's termination in 2013 despite millions of active users, the steady stream of products acquired into enterprise platforms and starved of investment — are notable for scale, not for being unusual. ### The Connectivity That Wasn't There -Not everyone's internet is always on - and this is consistently underweighted in the architecture of software sold to the industries where it most frequently fails. - -Construction sites operate at the edge of mobile coverage. A superintendent in a concrete frame building cannot get a signal three floors underground. Rural professional service firms - accounting firms in small towns, medical practices in counties with limited broadband, legal practices in areas where fiber has not reached - operate on connectivity that drops daily and fails entirely during weather events. Hospital clinical environments include zones where mobile devices are restricted near sensitive equipment. Air-gapped facilities - manufacturing, defense, government - cannot connect to any external network at all as a policy requirement. +Construction sites operate at the edge of mobile coverage. A superintendent in a concrete frame building can't get a signal three floors underground. Rural professional service firms operate on connectivity that drops daily. Hospital clinical environments restrict wireless devices near sensitive equipment. Air-gapped facilities — manufacturing, defense, government — can't connect to any external network by policy. For these users, offline capability is not a feature request. It is the baseline requirement. -The SaaS vendor's marketing page says "works on mobile," which is true when there is a signal. It does not say "works when there isn't one," because the centralized architecture makes that impossible without fundamental redesign. The application is a thin client rendering views from a remote database. Remove the remote database and the client has nothing to render. +The SaaS vendor's marketing page says "works on mobile," which is true when there's a signal. The application is a thin client rendering views from a remote database. Remove the remote database and the client has nothing to render. -Most SaaS platforms offer some form of "offline mode." What this means in practice is usually a read-only cache of recently viewed data, with form submissions that queue locally and attempt to upload when connectivity returns - with uncertain success rates and no visibility into what actually synced. You can view the last-synced version of a document. You cannot create new records, cannot run reports, cannot access data you have not recently viewed, and cannot have any confidence that what you submitted offline actually made it to the server. +Most SaaS platforms offer some form of "offline mode." In practice this means a read-only cache of recently viewed data, with form submissions that queue locally and attempt upload when connectivity returns — with uncertain success rates and no visibility into what actually synced. You can view the last-synced version of a document. You cannot create new records, run reports, or access data you haven't recently viewed. -The field operations manager who needs to log a safety inspection at seven in the morning on a construction site, before the crew starts work, has a few options when the SaaS is unreachable. Write it in a notebook and transcribe it later, with all the transcription errors that introduces. Use the app's read-only offline mode and hope the form submission queues correctly. Or skip the log and fill it in from memory when back in the office. All three options introduce risk. None of them should be necessary. The software should work on a construction site because that is where the work happens. +Sabina Rahman is a microfinance loan officer for a Grameen-affiliated branch in rural northern Bangladesh. She covers eleven villages twice a week on a company motorbike, processing loan applications, KYC documentation, and repayment ledgers on a SaaS platform her bank standardized on the year of her hire. The platform is unreachable from her branch for an average of four hours a day. -The mismatch extends beyond any single vertical. Reliable internet access is not universal, even in developed economies. Hospital clinical environments restrict wireless devices near sensitive equipment. Manufacturing and warehouse floors often have RF environments hostile to Wi-Fi. Agricultural operations span hundreds of acres - the field where something needs to be logged is rarely next to the fiber drop. Emergency response personnel work in exactly the places infrastructure fails first. For all of these workers, SaaS software's connectivity assumption is not an occasional inconvenience. It is a systematic design error applied to environments the designers never worked in. +The day she stopped trusting it was a monsoon-relief disbursement morning. Forty-seven applicants in queue by 8:00 a.m. The platform took submissions until 11:14. Then it went down. Sabina processed the remaining nineteen applications by hand, into a carbon-copy ledger she called *shotti'r khata* — the truth book — with the borrowers' thumbprints on the carbons. The platform came back at 16:32. None of the nineteen hand-processed applications appeared in it. The bank's compliance system flagged them as missing; the audit team flagged her as the failure. It took six weeks to enter all nineteen retroactively, with documentation explaining why the timestamps didn't match the borrowers' submissions. -Intermittent connectivity is not a US edge case. It is the global operational baseline. In Nigeria and South Africa, scheduled load-shedding cuts power for six to twelve hours daily; when electricity goes, routers and base stations go with it, and connectivity fails regardless of coverage quality. Hundreds of millions of enterprise workers in those economies plan their workdays around outage schedules, not around the assumption that the network is always available. In India, the 4G/3G/2G coverage gradient means that enterprise field operations - agricultural services, construction, financial services, healthcare - routinely run on intermittent connectivity across large portions of Tier 2 and Tier 3 cities and rural areas. Rural Brazil, rural Mexico, and most of Southeast Asia present comparable patterns at comparable scale. A SaaS platform that cannot function without a persistent connection does not have a niche offline problem. It has an architecture that excludes the majority of the world's enterprise users from full functionality. +Tariq Hassan works the other end of the spectrum, where connectivity fails for different reasons. He is an offshore field engineer on a UAE-operated platform in the Persian Gulf, two hundred and forty kilometers off the coast of Abu Dhabi. The platform's primary uplink is a Ku-band satellite. When weather conditions degrade the satellite — on average twice a month — the platform falls to a microwave backup. When both links drop, the platform is offline. -Sabina Rahman is one of those workers. She is a microfinance loan officer for a Grameen-affiliated branch in rural northern Bangladesh, in a Rangpur Division village forty kilometers from the nearest upazila headquarters; she covers eleven villages on a route she runs twice a week on a company motorbike. Her work is relationship banking the way it has been done in Bangladesh since 1976 - the year Muhammad Yunus made the first thirty loans of what would become Grameen Bank - and digital paperwork the way it has been done for the last decade. Loan applications, KYC documentation, repayment ledgers, monsoon-relief disbursements - all of it lives in a SaaS platform her bank standardized on the year of her hire. The platform is unreachable from her branch for an average of four hours a day. The mornings are the worst, when the entire upazila wakes up and pulls bandwidth at the same time. +The day Tariq stopped trusting the cloud's ingestion pipeline was a six-hour double-link outage. The data buffered on the platform's local server. The uplinks returned. The buffer drained. The SaaS application the operator had standardized on was a thin client — it expected the data to be in the cloud already, and the ingestion pipeline rejected six hours of out-of-sequence data as malformed. The data was not lost. The onshore monitoring team was looking at the cloud, and the cloud was missing six hours of a drilling shift on a well that had cost the operator two hundred and ten million dollars to that point. Tariq spent the next ten days writing a manual reconciliation report. -The day she stopped trusting the platform entirely was a monsoon-relief disbursement morning. Forty-seven applicants in queue at her branch by 8:00 a.m. The platform took submissions until 11:14. Then it went down. The applicants had taken half a day off from rice-paddy work to sit in the queue. Sabina processed the remaining nineteen applications by hand, into a carbon-copy ledger she had been keeping for two years and called *shotti'r khata* - the truth book - with the borrowers' thumbprints on the carbons and her own signature in blue ink. The platform came back at 16:32. None of the nineteen hand-processed applications appeared in it. The bank's compliance system flagged them as missing. The bank's audit team flagged her as the failure. It took six weeks to enter all nineteen retroactively, with documentation explaining why the timestamps did not match the borrowers' submissions. - -Sabina keeps a paper backup of every digital sign-off she has made since. Twelve years of binders. Grameen-style microfinance, she has been heard to say, teaches you not to trust networks you cannot see - the field officer carries the bank's reputation in her notebook because the village will trust the notebook longer than it will trust any vendor's uptime page. - -Tariq Hassan works the other end of the spectrum, where connectivity fails for opposite reasons. He is an offshore field engineer on a UAE-operated platform in the Persian Gulf, two hundred and forty kilometers off the coast of Abu Dhabi, one of nine Pakistani crew on a roster of forty-two. The platform's primary uplink is a Ku-band satellite. The backup is a microwave repeater on the next platform north. When weather conditions degrade the satellite - which happens on average twice a month and can last from forty minutes to fourteen hours - the platform falls back to the microwave. When the platform north is also degraded, both links drop and the platform is offline. Tariq's job is to keep the drilling-data feed running into the operator's onshore monitoring center in Dubai. - -The day Tariq stopped trusting the cloud's ingestion pipeline was a continuous double-link outage of just under six hours. The data buffered on the platform's local server. The uplinks returned. The buffer drained. The SaaS application the operator had standardized on the year Tariq was hired was a thin client - it expected the data to be in the cloud already, and the application's ingestion pipeline rejected six hours of out-of-sequence data as malformed. The data was not lost. It sat on the platform's local server for anyone who knew where to look. The onshore monitoring team was looking at the cloud, and the cloud was missing six hours of a drilling shift on a well that had cost the operator two hundred and ten million dollars to that point. Tariq spent the next ten days writing a manual reconciliation report that the SaaS vendor's account manager called "an inconvenience." Tariq called it something else, in Urdu, to a colleague who asked him later how the report had gone. - -Tariq learned to run a parallel local data capture in addition to the SaaS feed, on a laptop in his bunk that he had reformatted to a Linux distribution the platform's IT department was not aware existed. He never trusted cloud telemetry on the platforms after that. The practice did not fail him. He kept it. +Intermittent connectivity is not a US edge case. Scheduled load-shedding in Nigeria and South Africa cuts power for six to twelve hours daily; connectivity fails with it. Hundreds of millions of enterprise workers plan their workdays around outage schedules, not around the assumption that the network is always on. A SaaS platform that can't function without a persistent connection doesn't have a niche offline problem — it has an architecture that excludes the majority of the world's enterprise users from full functionality. ### The Data You Can't Get Back -Your vendor's terms of service say your data is yours. They are often technically correct - the vendor does not claim ownership of the content you create. What the terms of service do not address is *accessibility*. - -Data that you own but cannot retrieve is data you do not have. +Your vendor's terms of service say your data is yours. They are often technically correct — the vendor doesn't claim ownership of the content you create. What the terms don't address is *accessibility*. -Four mechanisms make data inaccessible while it technically "belongs" to you. +Data you own but cannot retrieve is data you don't have. -Export rate limits are the first. Many platforms allow data export but rate-limit the export API (Application Programming Interface) to prevent bulk extraction. A legal firm with ten years of matter history attempting a bulk export may find that retrieving its own data at the permitted rate takes weeks. During that window, the firm remains dependent on the vendor's infrastructure to operate - which is, not coincidentally, exactly the position the vendor prefers it to be in. - -Proprietary formats are the second. The export is available, but in a format only the vendor's tools read well. Attachments export without their metadata. Comment threads export as flat text without threading structure. Custom fields export as raw column headers without the semantic context that made them useful. The data is present; the information it represented is partially lost. - -Feature-gated access is the third. Some platforms require paid subscriptions to access export features, or limit export to higher pricing tiers. Users on free or lower tiers discover that their data is portable only as long as they keep paying - which means it is not portable at all. - -Account closure timing is the fourth. When a user cancels a subscription, access typically ends when the billing period ends. A user who cancels on the first of the month with a billing cycle that ends on the fifteenth has fifteen days to export before the account closes. Miss that window - because you changed jobs, because the cancellation notice did not clearly state the deadline - and the data may be gone. +Four mechanisms make data inaccessible while it technically "belongs" to you. Export rate limits: many platforms allow data export but rate-limit the export API to prevent bulk extraction; a legal firm with ten years of matter history may find that retrieving its own data at the permitted rate takes weeks. Proprietary formats: the export is available, but in a format only the vendor's tools read well — comment threads export as flat text, custom fields export as raw headers without semantic context. Feature-gated access: some platforms require paid subscriptions to access export features, so portability is contingent on continued payment. Account closure timing: access ends when the billing period ends; miss the export window — because you changed jobs, because the notice was unclear — and the data may be gone. None of these are edge cases. They are the routine operational parameters of vendor-managed data. ### The Price That Changes After You've Committed -Switching costs in SaaS are high because users build workflows around software. Training, integrations, historical data, learned patterns - these represent real investments. Vendors know this. Pricing structures often reflect it. - -Pricing is competitive during the acquisition phase, when vendors are winning customers and competing on features and price. After adoption, when the switching cost is real and rising, pricing pressure relaxes. A company that adopted a project management platform at $8 per seat per month, built an organization-wide workflow on it over two years, and now faces a renewal at $18 per seat per month confronts a real calculation: pay the new rate, or absorb the migration cost. The migration cost is often large enough that the price increase wins. - -Feature paywalls move in one direction. Features available on a given tier at adoption are not guaranteed to remain there. The roadmap description from three years ago that listed a capability as "included on Professional" may not match the current pricing page. Users who built workflows on features they understood to be included sometimes discover those features now require the next tier up. +Switching costs in SaaS are high because users build workflows around software. Training, integrations, historical data, learned patterns — these represent real investments. Vendors know this. -The per-seat model creates structural pressure as teams grow. A ten-person team's annual SaaS bill is manageable. A fifty-person team's bill at the same per-seat rate is five times larger, and by the time a company has reached fifty people using a platform, the switching cost has compounded accordingly. Teams that grow into enterprise sizes often find that per-seat pricing which was attractive at ten seats has become a significant budget line that IT attempts to renegotiate - often without success, because leverage has shifted. +Pricing is competitive during acquisition, when vendors are winning customers. After adoption, when switching costs are real and rising, pricing pressure relaxes. A company that adopted a project management platform at $8 per seat per month and now faces renewal at $18 per seat confronts a real calculation: pay the new rate, or absorb the migration cost. The migration cost is often large enough that the price increase wins. -Mid-contract price changes are less common but not rare. Platform economics shift, investor pressure changes, the competitive landscape evolves. Users who committed workflows and data to a platform signed a contract of sorts - and then discovered the other party's interpretation of that contract differed from their own. +Feature paywalls move in one direction. Features available on a given tier at adoption are not guaranteed to remain there. Per-seat models create structural pressure as teams grow — a ten-person team's bill scales to five times that at fifty people, by which point the switching cost has compounded accordingly. -The lock-in compounds when teams use multiple SaaS products that integrate with each other. A project management platform connected to a communication tool, a file storage service, a time tracker, and a billing system creates a dependency web where each integration raises the switching cost of every other platform. When one vendor raises prices, the team is not evaluating that product in isolation - they are evaluating the cost of unwinding a set of integrations built over years. Integration ecosystems serve the vendor's retention objectives as reliably as they serve the user's productivity. The web of dependencies is not a side effect of the SaaS model. From the vendor's perspective, it is a feature of it. +Lock-in compounds when teams use multiple SaaS products that integrate with each other. A project management platform connected to a communication tool, a file storage service, a time tracker, and a billing system creates a dependency web where each integration raises the switching cost of every other platform. The web of dependencies is not a side effect of the SaaS model. From the vendor's perspective, it is a feature of it. ### The Drift You Don't See -The first five modes manifest visibly. The platform stops loading, the vendor announces a shutdown, the laptop loses connectivity, the export fails, the price doubles. The user notices because the work stops. +The first five modes manifest visibly. The platform stops loading, the vendor announces shutdown, the laptop loses connectivity, the export fails, the price doubles. The user notices because the work stops. -This one does not. The system continues to operate normally. Two users edit the same record on different devices, then a sync conflict resolves silently in favor of one set of changes; the other user's work is gone, but no error appears and no notification fires. A formula recomputes against stale upstream values, propagating a subtly wrong number through downstream cells; the dashboard reports green. A duplicate record gets created when a unique-key constraint fails to enforce across replicas; both records persist, both look authoritative, and the application logic that depended on uniqueness produces wrong results until someone notices the second copy. The work appears to continue. The output is wrong. +This one doesn't. Two users edit the same record on different devices; a sync conflict resolves silently in favor of one set of changes, the other user's work is gone, and no error fires. A formula recomputes against stale upstream values, propagating a subtly wrong number through downstream cells; the dashboard reports green. A duplicate record gets created when a unique-key constraint fails to enforce across replicas; both records persist, both look authoritative, and the logic that depended on uniqueness produces wrong results until someone notices the second copy. -Silent corruption and silent divergence are the failure modes the user catches last and trusts the system about most. Production engineering teams who have shipped collaborative SaaS describe these as the bugs they fear most: not the loud failures, but the quiet ones that surface only when a customer notices a number does not add up or a record they remember saving is no longer there. The architecture matters here because of where convergence is decided. SaaS resolves conflicts inside vendor infrastructure with no surfacing primitive; the user only learns about the resolution if it is wrong enough to notice. The architecture I argue for in the chapters that follow makes the convergence-or-divergence question first-class at the data layer rather than implicit in vendor behavior. +Silent corruption and silent divergence are the failure modes production engineering teams fear most: not the loud failures, but the quiet ones that surface only when a customer notices a number doesn't add up. SaaS resolves conflicts inside vendor infrastructure with no surfacing primitive; the user only learns about the resolution if it's wrong enough to notice. The architecture developed in later chapters makes the convergence question first-class at the data layer rather than implicit in vendor behavior. ### The Third-Party Veto -The first six failure modes originate inside the service relationship. The vendor fails, decides, prices, or quietly drifts. Both the vendor and the customer are subject to the same disruption, and in most cases neither party wanted it. +The first six failure modes originate inside the service relationship. An external authority — a government, a regulator, a court — can restrict access regardless of what either party wants. The vendor has not failed. The customer has not been negligent. A third party with authority over one or both sides has acted. -The seventh does not. An external authority - a government, a regulator, a court - restricts access regardless of what either party wants. The vendor has not failed. The customer has not been negligent. A third party with authority over one or both sides of the relationship has acted, and the service relationship cannot continue. +In 2022, Western SaaS providers — Adobe, Autodesk, Microsoft, Figma, and dozens of others — suspended service across Russia and CIS markets under sanctions enforcement. Organizations across those markets, accounting for hundreds of thousands of seats built into workflows over more than a decade, found their operations interrupted not because their vendors failed them but because their vendors were directed to stop serving them. In February 2026, the US Defense Secretary designated Anthropic's AI services a national security supply-chain risk [1]. Federal agencies with active Anthropic deployments received direction to cease using them. Anthropic contested the designation legally [2], and a California court enjoined portions of the order for civilian agencies [3]. The Department of Defense exclusion stood [4]. Both Anthropic and its federal customers wanted to continue the relationship. Neither controlled the outcome. -The authority can act on the vendor. In 2022, Western SaaS providers - Adobe, Autodesk, Microsoft, Figma ([figma.com](https://www.figma.com/), the design tool), and dozens of others - suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement; organizations across those markets, accounting for many hundreds of thousands of seats built into workflows over more than a decade, found their operations interrupted not because their vendors failed them but because their vendors were directed to stop serving them. Software that had been licensed, trained on, and integrated into operational workflows became inaccessible with days of notice, not months. In February 2026, the US Defense Secretary designated Anthropic's AI services a national security supply-chain risk [1]. Federal agencies with active Anthropic deployments - deployments they found valuable and wished to continue - received direction under executive order to cease using them. Anthropic contested the designation legally [2], and a California court subsequently enjoined portions of the order for civilian agencies [3]. The Department of Defense exclusion stood [4]. Both Anthropic and its federal customers wanted to continue the relationship. Neither controlled the outcome. The analytically significant detail in both cases: the restriction came from a party with authority over the vendor, independent of both the vendor's and the customer's preferences. +The authority can act on the customer instead. Russia's Federal Law 242-FZ has required since 2015 that personal data of Russian citizens be stored on servers located within Russia; organizations using Western SaaS found themselves structurally non-compliant not because their vendor did anything but because the SaaS architecture can't provide on-premises data residency by design. The European Court of Justice's 2020 Schrems II ruling constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards. India's DPDP Act 2023 creates comparable obligations for Indian organizations using US-hosted services for Indian residents' personal data. -The authority can act on the customer. Russia's Federal Law 242-FZ - among the first general-purpose data localization laws globally, predating GDPR (General Data Protection Regulation) by two years - has required since 2015 that personal data of Russian citizens be stored on servers located within Russia; organizations using Western SaaS found themselves structurally non-compliant not because their vendor did anything but because the SaaS architecture cannot provide on-premises data residency by design. The European Court of Justice's 2020 Schrems II ruling constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards - the vendor continued operating; the customer's legal ability to continue using it was constrained. India's DPDP (Digital Personal Data Protection) Act 2023 is now creating comparable obligations for Indian organizations using US-hosted services for Indian residents' personal data. In each case, the customer becomes non-compliant regardless of the vendor's preferences or actions. - -The structural property that makes this failure mode distinct: data custody determines exposure. Data in vendor infrastructure can be reached by a government action targeted at the vendor. Data on hardware the user controls requires action targeted specifically at the user. The architecture either concentrates that exposure surface at the vendor or distributes it. +The structural property that makes this failure mode distinct: data custody determines exposure. Data in vendor infrastructure can be reached by a government action targeted at the vendor. Data on hardware the user controls requires action targeted specifically at the user. --- ## The Work That Doesn't Stop -The seven failure modes above describe what breaks. The work itself continues - that is the part most cloud-dependency arguments miss. They reach for whatever still works. +The seven failure modes above describe what breaks. The work itself continues — that's the part most cloud-dependency arguments miss. Workers reach for whatever still works. -In February 2026, HBO Max's medical drama *The Pitt* devoted two consecutive episodes to this scenario. The fictional Pittsburgh Trauma Medical Center pre-emptively takes its electronic health record system offline after two nearby hospitals are hit with ransomware. What follows is recognizable to anyone who has lived through an actual EHR outage: dry-erase boards return to the nurses' station, a fax machine reappears at triage, paper prescription pads come out of the supply closet, and triplicate forms circulate among medical assistants who have never seen them before - felt-tip markers oblivious to the carbon backing, the bottom copies coming out blank. A senior nurse spends much of the episode correcting the younger staff on the conventions of an analog workflow they have only heard about in training. The trauma center keeps operating. The patients still get seen. The work does not stop. +In February 2026, HBO Max's medical drama *The Pitt* devoted two consecutive episodes to this scenario. The fictional Pittsburgh Trauma Medical Center pre-emptively takes its electronic health record system offline after two nearby hospitals are hit with ransomware. Dry-erase boards return to the nurses' station. Paper prescription pads come out of the supply closet. Triplicate forms circulate among medical assistants who have never seen them — felt-tip markers oblivious to the carbon backing. The trauma center keeps operating. The patients get seen. The work doesn't stop. The episode is fiction. The pattern is not. Maria Santos lived it. -Maria was the IT operations administrator at a 312-bed teaching hospital in Belo Horizonte the morning the ransomware hit. She was three hours into her shift, sitting in her office with a coffee that had gone cold during the second of two morning standups, when the help-desk queue lit up. By 9:14 the EHR was unavailable system-wide. By 9:21 the radiology PACS was unreachable. By 9:30 she was in the CIO's office watching him try to reach the vendor's emergency line and getting an automated message that confirmed only that the vendor was aware of "an incident affecting multiple customers." - -The hospital had forty-seven patients in the OR queue that morning. The list of who was scheduled for what existed in the EHR. Without the EHR, the list existed in the heads of the nurses who had been reading it at 7 a.m. before everything went dark. Maria spent the next eleven hours doing things hospital administrators are not supposed to have to do. She walked the floor with a clipboard. She watched the triage nurses recreate patient acuity ratings on dry-erase boards. She stood next to a charge nurse who was trying to remember whether a man in Bay 4 had a sulfa allergy or a penicillin allergy because his chart was on a server that would not respond. She made eight phone calls that morning that ended with sentences she will not say again. *I don't know yet.* *We're working on it.* *I will call you when I have something to tell you.* - -The vendor restored access seventy-three hours later. The hospital had not lost a patient. Several other hospitals in the same vendor's customer base, hit the same week, had. Maria does not know what those hospitals' administrators were doing during their seventy-three hours and she does not need to know. She knows the shape of those hours from inside. +Maria was the IT operations administrator at a 312-bed teaching hospital in Belo Horizonte the morning the ransomware hit. By 9:14 the EHR was unavailable system-wide. By 9:21 the radiology PACS was unreachable. The hospital had forty-seven patients in the OR queue. Without the EHR, that list existed in the heads of the nurses who had read it at 7 a.m. Maria spent the next eleven hours walking the floor with a clipboard, watching triage nurses recreate patient acuity ratings on dry-erase boards, standing next to a charge nurse trying to remember whether a man in Bay 4 had a sulfa allergy or a penicillin allergy because his chart was on a server that wouldn't respond. -She still checks every clinical-data record three times before she signs off on a handoff. Once is procedure. Three times is what she carries from the morning she could not tell a charge nurse whether a man's chart said sulfa or penicillin. +The vendor restored access seventy-three hours later. The hospital had not lost a patient. Maria still checks every clinical-data record three times before she signs off on a handoff. Once is procedure. Three times is what she carries from the morning she couldn't tell a charge nurse whether a man's chart said sulfa or penicillin. -Healthcare ransomware incidents are tracked publicly by trackers including Comparitech, the HIPAA Journal, and the HHS OCR breach portal, and the count of US hospital ransomware events has run into the hundreds per year for several years now. Healthcare-services research has consistently associated ransomware-driven EHR downtime with elevated patient-harm metrics - the specific magnitudes vary by study and by the size of the disruption window. Healthcare professionals interviewed about *The Pitt* identified the same artifacts in their own incident logs: paper charts piling up at the nurses' station, prescriptions written by hand, hours of post-restoration overtime to back-fill the EHR with what happened on paper while the system was offline. The on-screen chaos is not exaggerated. It is documentary realism dressed as drama. +Healthcare ransomware incidents have run into the hundreds per year for several years. Healthcare-services research consistently associates EHR downtime with elevated patient-harm metrics. The on-screen chaos in *The Pitt* is not exaggerated — it is documentary realism dressed as drama. -Two observations matter for any architecture decision. First: the work continued because human practitioners knew what to do without the digital system. Triage worked. Charting worked. Billing eventually caught up. Domain expertise outlasts the software that depends on it. Second: the digital affordances did not survive. Search disappeared. Cross-shift handoff slowed to verbal report. Pattern detection across patient histories - the analytic work that justified the EHR investment in the first place - became impossible until the system came back. The organization's ability to *do* the work survived. Its ability to do the work *better than paper* did not. +Two observations drive every architecture decision that follows. First: the work continued because human practitioners knew what to do without the digital system. Domain expertise outlasts the software that depends on it. Second: the digital affordances didn't survive. Search disappeared. Pattern detection across patient histories — the analytic work that justified the EHR investment — became impossible until the system came back. The organization's ability to *do* the work survived. Its ability to do the work *better than paper* did not. -The same pattern repeats outside the hospital. When the SaaS project management platform goes down, the construction office runs on whiteboards and printed change-order forms. When the SaaS legal-research platform is unreachable, the law firm sends an associate to the print library. When the SaaS field-service application fails, the technician carries a paper work order and reconciles in the system the next day. None of these workarounds are the failure of the people. They are the *resilience* of the people. They are also a measurement of how much value the SaaS layer was adding versus how much it was simply mediating. +When the SaaS project management platform goes down, the construction office runs on whiteboards and printed change-order forms. When the SaaS legal-research platform is unreachable, the law firm sends an associate to the print library. None of these workarounds are the failure of the people. They are the *resilience* of the people. They are also a measurement of how much value the SaaS layer was adding versus how much it was simply mediating. -This is the gap the inverted stack closes. A SaaS outage takes everything digital with it; a local-first node holds the digital affordances on the device the practitioner is already using. The drawer of paper backup forms remains in the supply closet - every hospital should have one, every law firm should have one, every construction office should have one - but the drawer becomes a true backup rather than the only operating mode. When the network returns, the local node syncs. The post-incident overtime drops from days to minutes. The patient-harm signature of EHR downtime becomes a statistic about an architecture that the next generation of systems was designed to replace. That is the empirical case this dissertation builds. +This is the gap the inverted stack closes. A SaaS outage takes everything digital with it; a local-first node holds the digital affordances on the device the practitioner is already using. The drawer of paper backup forms remains in the supply closet — but the drawer becomes a true backup rather than the only operating mode. When the network returns, the local node syncs. The post-incident overtime drops from days to minutes. --- ## Who Pays the Most -These seven failure modes do not hit every organization equally. The organizations most exposed share a characteristic: they have the least structural leverage to address any of them. +These seven failure modes don't hit every organization equally. The most exposed share a characteristic: they have the least structural leverage to address any of them. -A large enterprise with a skilled procurement and IT organization can negotiate. Data portability clauses, SLAs with financial penalties, escrow provisions for source code and data - these are available to buyers with enough revenue to make the vendor's legal team engage seriously. When the vendor gets acquired, the enterprise has attorneys who can enforce contract terms or negotiate exit conditions. +A large enterprise with a skilled procurement team can negotiate. Data portability clauses, SLAs with financial penalties, escrow provisions for source code and data — these are available to buyers with enough revenue to make the vendor's legal team engage. When the vendor gets acquired, the enterprise has attorneys who can enforce contract terms. -Small and medium-sized professional service firms do not have this leverage. The legal practice with eight attorneys signs up through a website. The medical group with four physicians clicks through a terms of service that nobody reads. The construction firm with two project managers pays by credit card. Their vendor contract is the standard terms of service, unmodified. They have no SLA. They have no escrow. They have no explicit data portability requirement. If the vendor changes pricing, those users have no mechanism to object. If the vendor shuts down, they have whatever the shutdown announcement says they have. +Small and medium-sized professional service firms don't have this leverage. The legal practice with eight attorneys signs up through a website. The medical group with four physicians clicks through terms of service nobody reads. The construction firm with two project managers pays by credit card. Their vendor contract is the standard terms of service, unmodified — no SLA, no escrow, no explicit data portability requirement. -These are also the organizations where software failures have direct professional consequences rather than just operational inconvenience. The construction PM missing a bid deadline loses the bid - and damages the relationship with the client. The legal practice unable to access case files has a professional responsibility exposure. The medical practice that cannot retrieve patient records has regulatory risk. The stakes of availability are not abstract. +These are also the organizations where software failures have direct professional consequences rather than just operational inconvenience. The construction PM missing a bid deadline loses the bid and damages the client relationship. The legal practice unable to access case files has professional responsibility exposure. The medical practice that can't retrieve patient records has regulatory risk. The stakes of availability are not abstract. -And these organizations are the primary addressable market for the products most likely to carry the SaaS risks described above. The large enterprise with the IT team and the procurement counsel is using enterprise-licensed software with negotiated protections. The eight-attorney law firm is using the same product tier as the freelancer, under the same standard terms, with the same structural exposure to every failure mode described in this chapter. +And these organizations are the primary addressable market for the products most likely to carry the SaaS risks described above. The large enterprise with the IT team and procurement counsel uses enterprise-licensed software with negotiated protections. The eight-attorney law firm uses the same product tier as the freelancer, under the same standard terms, with the same structural exposure to every failure mode in this chapter. -This is not a coincidence. The SaaS bundle packages its desirable and undesirable properties together in a way that affects smaller buyers more severely, because smaller buyers have less ability to negotiate the undesirable half away. +This is not a coincidence. The SaaS bundle packages its desirable and undesirable properties in a way that affects smaller buyers more severely, because smaller buyers have less ability to negotiate the undesirable half away. -The regulatory dimension compounds this asymmetry. A legal practice storing confidential client communications in a vendor's cloud carries a professional duty to understand where that data lives and who can access it. A medical practice has HIPAA (Health Insurance Portability and Accountability Act) obligations. A construction firm with government contracts may have data residency requirements tied to those contracts. For large enterprises, these obligations get negotiated into vendor agreements with audit rights and data processing addenda. For the eight-attorney firm, the compliance answer is the vendor's standard privacy policy - a document written to protect the vendor, not the client. +The regulatory dimension compounds this. A legal practice storing client communications in a vendor's cloud carries a professional duty to understand where that data lives. A medical practice has HIPAA obligations. For large enterprises, these get negotiated into vendor agreements with audit rights and data processing addenda. For the eight-attorney firm, the compliance answer is the vendor's standard privacy policy — a document written to protect the vendor, not the client. -The jurisdictional scope of this compliance argument is wider than US-centric discussions typically acknowledge. The EU's Schrems II ruling, India's Digital Personal Data Protection Act 2023, the UAE's DIFC (Dubai International Financial Centre) Data Protection Law 2020, China's Personal Information Protection Law (PIPL, 2021), Brazil's LGPD (Lei Geral de Proteção de Dados, 2018), South Africa's POPIA (Protection of Personal Information Act, 2013), Nigeria's NDPR (Nigeria Data Protection Regulation, 2019), Japan's APPI (Act on the Protection of Personal Information), South Korea's PIPA (Personal Information Protection Act), and Russia's Federal Law 242-FZ are representative - each, in different language, makes data residency a compliance mechanism rather than a preference. The same pattern repeats across more than thirty national and regional frameworks; the full coverage table for this chapter is in Appendix F. In each of these jurisdictions, an architecture where data lives on the user's own hardware - not in a vendor's cloud region - is not merely preferred. In many configurations, it is the architecture that makes compliance tractable. The architecture I propose is frequently a legal requirement before it is an architectural choice. +The jurisdictional scope is wider than US-centric discussions acknowledge. The EU's Schrems II ruling, India's DPDP Act 2023, China's PIPL (2021), Brazil's LGPD (2018), South Africa's POPIA (2013), Nigeria's NDPR (2019), and Russia's Federal Law 242-FZ each make data residency a compliance mechanism rather than a preference. The full coverage table is in Appendix F. In each of these jurisdictions, an architecture where data lives on the user's own hardware is not merely preferred — in many configurations it is the architecture that makes compliance tractable. --- ## Why Users Have Accepted This -Until recently, they did not have a choice. +Until recently, they didn't have a choice. -Real-time collaboration requires that all parties see consistent state when they make concurrent changes. In 2008, the most practical way to guarantee this was a central server both parties could read from and write to simultaneously. Every other approach - emailing files, shared drives, version control - introduced either merge conflicts requiring manual resolution or coordination overhead requiring explicit locking. Real-time collaboration solved both problems by making divergence impossible: one copy, everyone editing the same one. +Real-time collaboration required a central server both parties could read from and write to simultaneously. Every other approach — emailing files, shared drives, version control — introduced merge conflicts requiring manual resolution or coordination overhead requiring explicit locking. One copy, everyone editing the same one, solved both. -Multi-device sync requires an authoritative copy that all devices agree on. When the cloud holds the authoritative copy, sync is the cloud pushing updates to each device. Without a cloud authority, devices have to figure out among themselves which version is current - and the consumer-grade protocols for resolving concurrent edits across devices reliably, at scale, without requiring user intervention, did not exist. Merging concurrent edits deterministically, without a server to adjudicate conflicts, was an unsolved problem for end-user software. +Multi-device sync required an authoritative copy that all devices agreed on. Without a cloud authority, devices had to figure out among themselves which version was current — and the consumer-grade protocols for resolving concurrent edits across devices reliably, at scale, without user intervention didn't exist. -Zero maintenance requires that someone else manage the infrastructure. The alternative is the user managing it, which requires IT capability that most small organizations do not have and do not want to develop. The comparison to self-hosted software circa 2005 is instructive: a self-hosted email server, a self-hosted project tracker, a self-hosted document collaboration platform - all theoretically possible, all practically demanding enough that most organizations paid someone else to handle it. - -The dependencies looked structural because they were structural. The technology for delivering these properties without vendor infrastructure either did not exist or was not mature enough to deploy without specialized expertise. CRDTs (Conflict-free Replicated Data Types) were academic research with a handful of experimental implementations. Gossip protocols ran inside distributed databases; nobody was building them into end-user applications. Container runtimes existed for server workloads; the packaged, embeddable, consumer-invisible form that makes Docker Desktop run silently on your laptop had not been built. +Zero maintenance required that someone else manage the infrastructure. The comparison to self-hosted software circa 2005 is instructive: a self-hosted email server, a self-hosted project tracker — all theoretically possible, all practically demanding enough that most organizations paid someone else. Users accepted the SaaS bundle not because they preferred the conditions on the second half but because the technology of the time made those conditions appear to be the cost of the first half. They were not accepting a bargain so much as acknowledging a constraint. -The constraint is removable - by the architecture this dissertation proposes. - -The evidence is commercial, not theoretical. The earliest and most consequential proof is African mobile money: M-PESA has processed financial transactions for hundreds of millions of users across East Africa since 2007; MTN MoMo operates at comparable scale across dozens of African markets. Both are built on offline-tolerant transaction patterns - store-and-forward reconciliation, intermittent-network authorization, operational continuity through connectivity gaps - because the networks they run on require it. Local-first architecture is not a new idea awaiting adoption; it has operated at population scale for nearly two decades in the markets that most benefit from it. +The constraint is removable. -In the professional software space, Linear ([linear.app](https://linear.app/), the issue tracker) demonstrates that a sync engine can run locally even inside a SaaS architecture - clients keep a local SQLite replica, and the cloud is demoted to a relay peer for the engine layer. Authoritative data still lives on Linear's servers; the architecture I argue for takes the next step. Figma is often cited in the same breath because Figma uses CRDT-flavored mechanisms for multiplayer cursor coordination - but Figma's data lives on Figma's servers and the local client is not authoritative; Figma is a collaboration win, not a data-sovereignty architecture. Actual Budget delivers full personal finance capability with the user's data on local storage and the sync service optional, with no vendor data custody required. Anytype extends the pattern with end-to-end encrypted sync over user-controlled storage. +The evidence is commercial, not theoretical. M-PESA has processed financial transactions for hundreds of millions of users across East Africa since 2007; MTN MoMo operates at comparable scale across dozens of African markets. Both are built on offline-tolerant transaction patterns — store-and-forward reconciliation, intermittent-network authorization, operational continuity through connectivity gaps — because the networks they run on require it. Local-first architecture is not a new idea awaiting adoption; it has operated at population scale for nearly two decades in the markets that most benefit from it. -These products demonstrate that the desirable half of the SaaS bundle - collaboration, sync, responsive UI - does not require vendor data custody to function. Users who have worked with software built on these foundations know what it feels like when software keeps running after the internet goes out. The acceptance erodes when the alternative is observable, not theoretical. +In professional software, Linear demonstrates that a sync engine can run locally even inside a SaaS architecture — clients keep a local SQLite replica, and the cloud is demoted to a relay peer. Actual Budget delivers full personal finance capability with the user's data on local storage and the sync service optional. Anytype extends the pattern with end-to-end encrypted sync over user-controlled storage. These products demonstrate that the desirable half of the SaaS bundle — collaboration, sync, responsive UI — doesn't require vendor data custody to function. --- ## The Dependency That Looks Inevitable -Three independent technology shifts removed the structural necessity of the SaaS bundle: CRDTs (Conflict-free Replicated Data Types) in production at Linear, Automerge, Yjs, and Actual Budget; leaderless replication at the edge (the same family of protocols Cassandra and DynamoDB use at planetary scale, applied without modification at five-machine team scale); and the local-service pattern that tools like VS Code language servers, Docker Desktop, and Tailscale made invisible to users. Each shift solved a problem unrelated to the SaaS bundle. The consequence - that the technical reasons SaaS architectures had to concentrate data at the vendor are gone - followed from those solutions. Chapter 2 develops each in full. +Three independent technology shifts removed the structural necessity of the SaaS bundle: CRDTs (Conflict-free Replicated Data Types) in production at Linear, Automerge, Yjs, and Actual Budget; leaderless replication at the edge — the same family of protocols Cassandra and DynamoDB use at planetary scale, applied at five-machine team scale; and the local-service pattern that tools like VS Code language servers, Docker Desktop, and Tailscale made invisible to users. Each shift solved a problem unrelated to the SaaS bundle. The consequence — that the technical reasons SaaS architectures had to concentrate data at the vendor are gone — followed from those solutions. Chapter 2 develops each in full. -The architecture this dissertation proposes has real costs. They do not disappear; they move. Software that ships to user-controlled hardware needs a helpdesk model, software-bill-of-materials discipline, patch cadence, key custody, schema migration across independently upgrading nodes, and operational telemetry from machines the operator does not own. Part III specifies the architecture that absorbs those commitments. Part IV specifies the playbooks that ship and operate it. The trade is vendor dependency for operational discipline. Most readers will conclude the trade is worth making for workloads where data sovereignty, regulatory exposure, or operational continuity rule out the SaaS bundle. Some will not. Chapter 4 helps you decide. +The architecture this book proposes has real costs. They don't disappear; they move. Software that ships to user-controlled hardware needs a helpdesk model, software-bill-of-materials discipline, patch cadence, key custody, schema migration across independently upgrading nodes, and operational telemetry from machines the operator doesn't own. Part III specifies the architecture that absorbs those commitments. Part IV specifies the playbooks that ship and operate it. The trade is vendor dependency for operational discipline. Most readers will conclude the trade is worth making for workloads where data sovereignty, regulatory exposure, or operational continuity rule out the SaaS bundle. Chapter 4 helps you decide. -Marcus's scenario - deadline-critical work held hostage by infrastructure he does not control - is the failure mode this architecture addresses first. His data was never gone. It was inaccessible because the software's design placed it somewhere he could not reach. The remaining chapters specify a design where that distinction does not exist. +Sunita's scenario — deadline-critical work held hostage by infrastructure she doesn't control — is the failure mode this architecture addresses first. Her data was never gone. It was inaccessible because the software's design placed it somewhere she couldn't reach. The remaining chapters specify a design where that distinction doesn't exist. -The building blocks are production-proven. What remains is the specific assembly that produces a node - not a smarter cache, not a thicker client, but a first-class local peer that behaves like a cloud application, passes enterprise security review, and treats user data ownership as a structural guarantee rather than a contractual one. Chapter 2 identifies exactly what that requires and where the existing work stops short. Chapter 3 draws the node. +The building blocks are production-proven. What remains is the specific assembly that produces a node — not a smarter cache, not a thicker client, but a first-class local peer that behaves like a cloud application, passes enterprise security review, and treats user data ownership as a structural guarantee rather than a contractual one. Chapter 2 identifies exactly what that requires and where the existing work stops short. Chapter 3 draws the node. --- diff --git a/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md b/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md index 20e27ae..4e9c5fa 100644 --- a/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md +++ b/vol-1/part-1-thesis-and-pain/ch02-local-first-serious-stack.md @@ -1,89 +1,87 @@ # Chapter 2 - Local-First: From Sync Toy to Serious Stack - + --- -In 2019, researchers at Ink & Switch posed a hypothesis they called local-first software [1]. The question was structural, not legal. What would it take for software to keep your data on your machine, sync it when convenient, and refuse to stop working the moment a vendor server fails or a company changes its business model? They proposed an answer in seven properties - a testable definition the field could use to separate what counts from what merely calls itself local-first. +In 2019, researchers at Ink & Switch posed a hypothesis they called local-first software [1]. The question was structural, not legal. What would it take for software to keep your data on your machine, sync it when convenient, and refuse to stop working the moment a vendor server fails or a company changes its business model? They proposed an answer in seven properties — a testable definition the field could use to separate what counts from what merely calls itself local-first. -The seven properties expose exactly where every existing attempt falls short - including the best commercial ones. Getting to all seven requires more than clever sync. It requires running a complete application stack at the edge, not a smarter cache of someone else's database. +The seven properties expose exactly where every existing attempt falls short, including the best commercial ones. Getting to all seven requires more than clever sync. It requires running a complete application stack at the edge, not a smarter cache of someone else's database. -The word "serious" in this chapter's title is not a claim about complexity. It is a claim about scope. A sync toy satisfies one or two of the seven properties and defers the hard ones. A serious stack satisfies all seven. And it adds what the ideals paper did not. The deployment model. The security model. The governance model. The migration story. The path to commercial sustainability. **The composition is the contribution** - not the individual components, which are all production-proven somewhere, but the assembly that lets them be one system. +The word "serious" in this chapter's title is not a claim about complexity. It is a claim about scope. A sync toy satisfies one or two of the seven properties and defers the hard ones. A serious stack satisfies all seven — and adds what the ideals paper did not: a deployment model, a security model, a governance model, a migration story, and a path to commercial sustainability. **The composition is the contribution** — not the individual components, which are all production-proven somewhere, but the assembly that lets them function as one system. --- ## The Seven Ideals -The seven properties from Kleppmann et al. [1] are not a wishlist. They are a minimum bar - a filter calibrated to fail anything that approximates local-first without actually being it. Most apps pass two or three. Almost nothing passes all seven. The ones that fail are instructive, because they fail in the same places, for the same reasons. +The seven properties from Kleppmann et al. [1] are not a wishlist. They are a minimum bar — a filter calibrated to fail anything that approximates local-first without actually being it. Most apps pass two or three. Almost nothing passes all seven. The ones that fail are instructive, because they fail in the same places, for the same reasons. -**No spinners, no waiting.** The software responds instantly because it reads from local state, not from a network request. In practice, most apps fail this for anything beyond trivial reads. A project management tool that must phone home to load the task list fails the property during the first round-trip. It fails permanently when the network is gone. +**No spinners, no waiting.** The software responds instantly because it reads from local state, not from a network request. In practice, most apps fail this for anything beyond trivial reads. A project management tool that phones home to load the task list fails the property during the first round-trip and fails permanently when the network is gone. -**Work is not trapped on one device.** Your data on your laptop should be your data on your desktop, your tablet, your colleague's machine. Sync across devices and across people - not as a feature behind a subscription upgrade, but as a structural property. Apps that sync through a vendor's servers pass the property only while the vendor exists and the subscription is paid. When either condition ends, the data is trapped. +**Work is not trapped on one device.** Data on a laptop should be data on a desktop, a tablet, a colleague's machine. Sync across devices and across people — not as a feature behind a subscription upgrade, but as a structural property. Apps that sync through a vendor's servers pass the property only while the vendor exists and the subscription is paid. When either condition ends, the data is trapped. -**The network is optional.** Not "the network is preferred." Not "reduced functionality offline." Optional means the full application works without any network connection, indefinitely, and then syncs when a connection becomes available. This eliminates every app whose read path hits a remote API (Application Programming Interface). It eliminates every app whose write path queues locally and waits. Real offline requires that the local node hold an authoritative copy of data it is allowed to act on. +**The network is optional.** Not "the network is preferred." Not "reduced functionality offline." Optional means the full application works without any network connection, indefinitely, then syncs when a connection becomes available. This eliminates every app whose read path hits a remote API and every app whose write path queues locally and waits. Real offline requires the local node to hold an authoritative copy of data it is allowed to act on. -**Seamless collaboration.** Multiple people should be able to edit the same data simultaneously - without explicit locking, without "checkout" workflows, without a person designated to resolve conflicts manually. This is the property that made centralized servers feel necessary. If two people are writing concurrently, something has to decide the order. CRDTs (Conflict-free Replicated Data Types) provide the mathematical alternative: merge semantics that guarantee convergence without a coordinator. Software that requires a server to adjudicate concurrent writes fails this property the moment the server is unreachable. +**Seamless collaboration.** Multiple people should edit the same data simultaneously — without explicit locking, without checkout workflows, without a person designated to resolve conflicts manually. CRDTs (Conflict-free Replicated Data Types) provide the mathematical alternative: merge semantics that guarantee convergence without a coordinator. Software that requires a server to adjudicate concurrent writes fails this property the moment the server is unreachable. -**The long now.** Your data should outlive the vendor, the subscription, the company's strategic priorities, and the political conditions under which the service operates. A user who adopted Sunrise Calendar built workflows on it. When Microsoft shut it down in 2016, those workflows had an expiry date the user did not know about. A more recent and more consequential demonstration came in 2022. Adobe suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement [10]. Autodesk suspended commercial activities in Russia [12]. Microsoft suspended new sales of products and services in Russia [13]. Figma ([figma.com](https://www.figma.com/), the design tool) blocked Russia-based users in compliance with US sanctions [11]. Dozens of other Western SaaS (Software as a Service) providers followed. Hundreds of thousands of organizations that had built operational workflows on those platforms over more than a decade lost access with days of notice. The long now means data in an open format, stored on user-controlled hardware, remains accessible regardless of what happens to the company that made the tool - or the jurisdiction the company operates in. Proprietary sync formats - even sync formats that feel invisible - fail this property. +**The long now.** Data should outlive the vendor, the subscription, the company's strategic priorities, and the political conditions under which the service operates. A user who adopted Sunrise Calendar built workflows on it. When Microsoft shut it down in 2016, those workflows had an expiry date the user did not know about. In 2022, Adobe suspended service across Russia and CIS markets under sanctions enforcement [10]. Autodesk suspended commercial activities in Russia [12]. Microsoft suspended new sales of products and services in Russia [13]. Figma blocked Russia-based users in compliance with US sanctions [11]. Hundreds of thousands of organizations that had built operational workflows on those platforms over more than a decade lost access with days of notice. The long now means data in an open format, stored on user-controlled hardware, remains accessible regardless of what happens to the company that made the tool — or the jurisdiction the company operates in. -**Security and privacy by default.** Data that lives locally is harder to breach at scale. A centralized database is a target; exfiltrating it compromises every user simultaneously. Distributed local stores raise the cost of attack - an adversary who compromises one node gets one user's data, not all users' data. Local storage without encryption creates a different problem: physical access to the device is sufficient. Security by default means end-to-end encryption at rest and in transit, with key control in the user's hands, not the vendor's. A distinct threat model applies in jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements: architectures where keys never leave the user's device address a compliance constraint that cloud storage cannot satisfy architecturally, regardless of the vendor's intent. A local app that stores data in plaintext fails this property as badly as a cloud app does. +**Security and privacy by default.** Data that lives locally is harder to breach at scale. A centralized database is a target; exfiltrating it compromises every user simultaneously. Distributed local stores raise the cost of attack — an adversary who compromises one node gets one user's data, not all users' data. Security by default means end-to-end encryption at rest and in transit, with key control in the user's hands, not the vendor's. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, architectures where keys never leave the user's device address a compliance constraint that cloud storage cannot satisfy architecturally, regardless of the vendor's intent. -**You retain ultimate ownership and control.** The user decides where the data lives, who can access it, and when to delete it. This is not a contractual guarantee. It is a structural one. The bits live on hardware the user controls, in a format the user can read, under encryption the user can manage. Ownership conveyed only through a contract is ownership that can be revoked when the contract changes. +**You retain ultimate ownership and control.** The user decides where the data lives, who can access it, and when to delete it. This is not a contractual guarantee — it is a structural one. The bits live on hardware the user controls, in a format the user can read, under encryption the user can manage. Ownership conveyed only through a contract is ownership that can be revoked when the contract changes. -Seven properties. Together they describe software that works for the user independent of vendor survival, vendor pricing, and vendor infrastructure. To Kleppmann et al.'s knowledge at time of writing, no production app satisfied all seven. The closest candidate is Anytype, which satisfies five - CRDT (Conflict-free Replicated Data Type)-based collaboration and zero-knowledge encryption by default - but falls short on the long now (its full-fidelity export uses a proprietary Any-Block format no competing app reads natively) and on ultimate ownership (the application layer is "source available," not open-source; structural vendor independence depends on a contractual arrangement with the Any Association, not the architecture alone). Kleppmann himself no longer treats the seven as a binary checklist. At Local-First Conf 2024 he acknowledged the properties form "a gradient" rather than a pass-or-fail definition [3]. The seven remain the most rigorous available filter. No production app has cleared them all. +Together, the seven properties describe software that works for the user independent of vendor survival, vendor pricing, and vendor infrastructure. To Kleppmann et al.'s knowledge at time of writing, no production app satisfied all seven. At Local-First Conf 2024, Kleppmann acknowledged the properties form "a gradient" rather than a pass-or-fail definition [3]. The seven remain the most rigorous available filter. --- ## What Exists Today: A Taxonomy of Local-First Attempts -The local-first community has produced serious work. The apps below are not failures. They are the best commercial implementations of local-first thinking available. Their limitations are not oversights. They are the boundary where local-first principles meet the practical difficulty of running a full application stack at the edge. +The local-first community has produced serious work. The apps below are not failures — they are the best commercial implementations of local-first thinking available. Their limitations are not oversights. They are the boundary where local-first principles meet the practical difficulty of running a full application stack at the edge. ### The Document Sync Apps (Obsidian, Notion) -Obsidian stores notes as plain markdown files on your local filesystem. This is a genuinely correct choice. Plain text in an open format, on your own storage, is the most durable data model available. No import problem, no export problem, no proprietary encoding. If Obsidian disappears tomorrow, the files remain and every text editor on the planet reads them. The long-now property is satisfied by the data format alone. +Obsidian stores notes as plain markdown files on a local filesystem. Plain text in an open format, on user-controlled storage, is the most durable data model available. No import problem, no export problem, no proprietary encoding. If Obsidian disappears, the files remain and every text editor reads them. The long-now property is satisfied by the data format alone. -Where Obsidian stops is structured data and collaboration. Markdown files have a limited conflict resolution strategy: when two devices modify the same file concurrently, Obsidian's sync service attempts a line-level text merge for plain markdown but falls back to a conflict copy when merging fails or for non-text files. The conflict copy sits alongside the original. Resolution is manual. For a solo note-taker, this is an infrequent and tolerable annoyance. For a team using shared notes to track client work, project status, or decisions - where concurrent edits are the norm - the duplicate-file model fails. Obsidian's sync has no CRDT underneath it. The conflict strategy is to tell the user a conflict exists and let them figure it out. +Where Obsidian stops is structured data and collaboration. When two devices modify the same file concurrently, Obsidian's sync service attempts a line-level text merge for plain markdown but falls back to a conflict copy when merging fails or for non-text files. The conflict copy sits alongside the original; resolution is manual. For a solo note-taker, this is an infrequent and tolerable annoyance. For a team using shared notes to track client work, project status, or decisions — where concurrent edits are the norm — the duplicate-file model fails. Obsidian's sync has no CRDT underneath it. The conflict strategy is to tell the user a conflict exists and let them figure it out. -The deeper limitation is scope. Markdown files have no relational structure, no queryable schema, no concept of record types that relate to each other. A project has tasks. A task has a status, an assignee, a due date, subtasks, comments, and attachments. None of that fits in a flat text file without inventing a convention, and no two Obsidian users will invent the same convention. The moment a team needs structured data - not documents, but records - Obsidian's model breaks down. It is a document tool that happens to sync, not a structured-data tool with local-first properties. +The deeper limitation is scope. Markdown files have no relational structure, no queryable schema, no concept of record types that relate to each other. A project has tasks. A task has a status, an assignee, a due date, subtasks, comments, and attachments. None of that fits in a flat text file without inventing a convention, and no two Obsidian users will invent the same one. The moment a team needs structured data — not documents, but records — Obsidian's model breaks down. -Notion presents the inverse problem. It has structured data: databases, filtered views, linked records, formulas. But it is architecturally a web application with a rich offline cache. The authoritative copy remains on Notion's servers throughout. Concurrent edits go through those servers, which hold the authoritative copy. The long-now property fails immediately. Notion data lives in Notion's proprietary format, on Notion's servers, accessible only through Notion's application. An export produces a ZIP archive of markdown files and CSVs - a representation, not a migration. The relational structure, the filters, the formulas, the comment threads - none of these export faithfully to a format another application understands. +Notion presents the inverse problem. It has structured data: databases, filtered views, linked records, formulas. But it is architecturally a web application with a rich offline cache. The authoritative copy remains on Notion's servers. Concurrent edits go through those servers. The long-now property fails immediately: Notion data lives in Notion's proprietary format, on Notion's servers, accessible only through Notion's application. An export produces a ZIP archive of markdown files and CSVs — a representation, not a migration. The relational structure, the filters, the formulas, the comment threads — none export faithfully to a format another application understands. -Both approaches demonstrate a genuine tension. Plain-file formats satisfy the long now but cannot support structured collaboration. Structured databases support collaboration but require a centralized authority. The missing piece is a data model that is both structured and convergent - which is what CRDTs over a typed document store provide. +Both approaches expose a genuine tension. Plain-file formats satisfy the long now but cannot support structured collaboration. Structured databases support collaboration but require a centralized authority. The missing piece is a data model that is both structured and convergent — which is what CRDTs over a typed document store provide. -### The Lightweight Replica Apps (Linear ([linear.app](https://linear.app/), the issue tracker), Liveblocks) +### The Lightweight Replica Apps (Linear, Liveblocks) -Each Linear client maintains a local SQLite replica of the user's team data [8]. Writes go to local state first. The sync engine applies them to the local replica immediately and propagates to the server asynchronously. The result is an application that feels instant - no loading spinners, no optimistic-update lag, no visible round trips. The gap is where the replica ends. Linear's local SQLite database is a replica: it reflects a copy of server state, not an authoritative local node. The server remains the source of truth. Linear surfaces the sync state in the UI when the server is unreachable, so writes that depend on server-side validation (status changes on issues, comment submissions, project mutations) are visibly queued rather than silently dropped - but the queue still depends on the relay coming back. More critically, Linear's sync protocol is proprietary. It has no peer-to-peer mode. Two Linear clients on the same local network cannot sync directly with each other when the internet is down. The relay is Linear's infrastructure, and it is not optional. +Each Linear client maintains a local SQLite replica of the user's team data [8]. Writes go to local state first; the sync engine applies them to the local replica immediately and propagates to the server asynchronously. The result is an application that feels instant — no loading spinners, no optimistic-update lag, no visible round trips. -Background jobs - notifications, automations, integrations - run server-side. An automation that moves issues between states when conditions are met does not run on the local node. It runs in Linear's cloud. Remove the cloud and the automation stops. The local replica is a performance optimization and a UX improvement. It is not a full node. +The gap is where the replica ends. Linear's local SQLite database is a replica: it reflects a copy of server state, not an authoritative local node. The server remains the source of truth. Linear surfaces the sync state in the UI when the server is unreachable, so writes that depend on server-side validation are visibly queued rather than silently dropped — but the queue still depends on the relay coming back. More critically, Linear's sync protocol is proprietary. It has no peer-to-peer mode. Two Linear clients on the same local network cannot sync directly with each other when the internet is down. The relay is Linear's infrastructure, and it is not optional. -The practical consequence: Linear passes the "no spinners" property and partially passes "the network is optional" for reads. It does not pass network-optional for writes to server-owned records, does not pass peer-to-peer collaboration without Linear's relay, does not pass vendor independence, and does not pass the long now - Linear's data lives in Linear's format, accessible through Linear's API, exportable to CSV only. Liveblocks and similar CRDT-as-a-service frameworks push further in the CRDT direction but relocate the vendor dependency to hosted infrastructure rather than eliminating it. +Background jobs — notifications, automations, integrations — run server-side. An automation that moves issues between states when conditions are met does not run on the local node. It runs in Linear's cloud. Remove the cloud and the automation stops. The local replica is a performance optimization and a UX improvement. It is not a full node. -Replicache ([replicache.dev](https://replicache.dev/), the sync framework from Rocicorp) is the most direct production competitor in this category and the system most often suggested as an off-the-shelf path to local-first apps. Replicache provides a sync framework rather than a complete application: developers integrate the Replicache client into their app, supply server endpoints that produce mutation diffs, and receive a local-first reactive cache for free [9]. The model is correct for the sync layer it covers - optimistic mutation, conflict-free pull-based reconciliation, sub-second responsiveness from a local IndexedDB cache. The gap is the same as Linear's: the server is the source of truth, the mutators run server-side to validate against authoritative state, and offline writes queue against an eventual reconciliation that the developer's server controls. Replicache solves the latency and reactivity problems extremely well within a smart-cache architecture. It does not produce a full node. The framework is also deliberately scoped to the sync transport - schema migration, key custody, MDM packaging, and the business model are application-developer responsibilities, not framework features. +Liveblocks and similar CRDT-as-a-service frameworks push further in the CRDT direction but relocate the vendor dependency to hosted infrastructure rather than eliminating it. -### The Local-First Finance App (Actual Budget) - -Actual Budget runs entirely offline by default - no account required, no network request during normal operation. All budget data lives in a local SQLite file the user can copy, back up, or open directly. When the network is unavailable, Actual Budget functions identically to when it is available, because its operation does not depend on the network at any point. +Replicache ([replicache.dev](https://replicache.dev/)) is the most direct production competitor in this category. It provides a sync framework rather than a complete application: developers integrate the Replicache client into their app, supply server endpoints that produce mutation diffs, and receive a local-first reactive cache [9]. The model is correct for the sync layer it covers — optimistic mutation, conflict-free pull-based reconciliation, sub-second responsiveness from a local IndexedDB cache. The gap is the same as Linear's: the server is the source of truth, the mutators run server-side, and offline writes queue against a reconciliation the developer's server controls. Replicache solves the latency and reactivity problems extremely well within a smart-cache architecture. It does not produce a full node. -This satisfies the first property (no spinners), the third (network optional), and substantially the seventh (ownership and control - the user has a file on their disk). It makes a credible attempt at the fifth (the long now) by virtue of using an open database format that other tools can read. +### The Local-First Finance App (Actual Budget) -Where Actual Budget stops is collaboration and multi-device sync. The application is single-user by design. Two people cannot jointly manage a budget in Actual Budget without manual coordination: exporting the file, sending it, importing it, hoping no concurrent changes need to be merged. The optional sync service Actual Budget offers addresses multi-device access for a single user - the budget file syncs across the user's own devices through a hosted relay. This reintroduces a central server, though the server's role is deliberately minimal: relay and backup, not authority. +Actual Budget runs entirely offline by default — no account required, no network request during normal operation. All budget data lives in a local SQLite file the user can copy, back up, or open directly. When the network is unavailable, Actual Budget functions identically to when it is available. -The team collaboration case does not exist. Actual Budget has no concept of roles, permissions, concurrent edits, or conflict resolution between multiple users. Its data model is single-user because its design is single-user. Adapting it to multi-user team workflows would require adding CRDTs, a distributed data model, access control, and a sync protocol - at which point it would no longer be Actual Budget, but a substantially new system. +This satisfies the first property (no spinners), the third (network optional), and substantially the seventh (ownership and control). It makes a credible attempt at the fifth (the long now) by using an open database format that other tools can read. -The lesson from Actual Budget is that full local-first operation for a single user is achievable and commercially viable. The leap to team collaboration without reintroducing a central authority is the hard part that Actual Budget does not attempt. +Where Actual Budget stops is collaboration and multi-device sync. Two people cannot jointly manage a budget without manual coordination: exporting the file, sending it, importing it, hoping no concurrent changes need to be merged. The optional sync service addresses multi-device access for a single user through a hosted relay — which reintroduces a central server, though its role is deliberately minimal: relay and backup, not authority. The team collaboration case does not exist. Actual Budget has no concept of roles, permissions, concurrent edits, or conflict resolution between multiple users. -### The Research Prototypes (Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge), a JSON-like CRDT library), Ink & Switch Essays) +The lesson from Actual Budget is that full local-first operation for a single user is achievable and commercially viable. The leap to team collaboration without reintroducing a central authority is the hard part Actual Budget does not attempt. -Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge), a JSON-like CRDT library) and the Ink & Switch body of work represent the most theoretically rigorous local-first implementation available [1]. Automerge is a CRDT library. Given any two copies of an Automerge document that diverged during a network partition, merge them and get the same result regardless of merge order. The algorithm is correct. The library is production-quality for its intended use case. Ink & Switch has published detailed essays on collaborative applications built on Automerge - Pushpin, Backchat, Trellis - that demonstrate what local-first collaboration looks like in practice when the data model is right. +### The Research Prototypes (Automerge, Ink & Switch Essays) -The gap between Automerge and a deployable production system is significant and intentional. Automerge is a library that operates on documents. It assumes the existence of a sync transport - something to move operations between peers. Several sync backends exist (the Automerge sync server, AutomergeRepo), and they work correctly. They provide no production deployment model for end-user software: enterprise governance, per-role access control, CP-class record types that require distributed lease coordination, financial correctness guarantees, key management at scale, MDM (Mobile Device Management)-compatible installers, or a business model. +Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge)) and the Ink & Switch body of work represent the most theoretically rigorous local-first implementation available [1]. Automerge is a CRDT library: given any two copies of an Automerge document that diverged during a network partition, merge them and get the same result regardless of merge order. The algorithm is correct. Ink & Switch has published detailed essays on collaborative applications built on Automerge — Pushpin, Backchat, Trellis — that demonstrate what local-first collaboration looks like in practice when the data model is right. -The Ink & Switch essays are explicit about this. Pushpin is a demonstration. Backchat is a prototype. The essays document what is possible and identify what remains to be engineered. They are research artifacts, not shipping products. A developer who picks up Automerge and AutomergeRepo has the correct CRDT primitive and a working sync transport. They have not acquired a production system. They have acquired the foundation for one. +The gap between Automerge and a deployable production system is significant and intentional. Automerge is a library that operates on documents. It assumes the existence of a sync transport — something to move operations between peers. Several sync backends exist and they work correctly. They provide no production deployment model for end-user software: enterprise governance, per-role access control, CP-class record types that require distributed lease coordination, financial correctness guarantees, key management at scale, MDM (Mobile Device Management)-compatible installers, or a business model. -The document-centric nature of Automerge is also a structural constraint. Documents are a natural fit for rich text, drawings, and unstructured collaborative content. A team running a field operation with structured records - work orders, inspection logs, invoices, asset registries - needs typed records with schema migration, not just documents. The CRDT merge semantics generalize across both cases, but the tooling, the query model, and the schema evolution story are different problems that Automerge leaves to application builders. +The Ink & Switch essays are explicit about this. Pushpin is a demonstration. Backchat is a prototype. A developer who picks up Automerge and AutomergeRepo has the correct CRDT primitive and a working sync transport — not a production system, but the foundation for one. ```mermaid graph LR @@ -115,9 +113,9 @@ graph LR --- -## What Each Gets Right - and Where It Stops +## What Each Gets Right — and Where It Stops -Each approach takes local-first seriously in one layer and builds on a centralized dependency in another. Obsidian chose plain files for durability and sacrificed structured collaboration. Linear built a local replica for latency and left authority on the server. Replicache built a sync framework and left the rest to the developer's server. Actual Budget delivered full local authority for a single user and stopped short of team sync. Automerge built correct CRDT merge and left the production deployment model to application builders. Each dependency reflects a real problem the approach did not attempt to solve. +Each approach takes local-first seriously in one layer and builds on a centralized dependency in another. Obsidian chose plain files for durability and sacrificed structured collaboration. Linear built a local replica for latency and left authority on the server. Replicache built a sync framework and left the rest to the developer's server. Actual Budget delivered full local authority for a single user and stopped short of team sync. Automerge built correct CRDT merge and left the production deployment model to application builders. The pattern becomes clearest in a like-for-like comparison across the four axes that determine whether a system meets a serious local-first bar: @@ -132,31 +130,27 @@ The pattern becomes clearest in a like-for-like comparison across the four axes | **Actual Budget** | Fully local + optional self-hosted sync | User-held SQLite | User-device only | Open-source; user runs everything | | **Automerge** | Library + sync transport (developer-supplied) | Whatever the application chooses | Whatever the application chooses | Open-source library | -The table makes the gap visible. Every system that satisfies vendor-independent data ownership stops short of team collaboration; every system that supports team collaboration delegates authority to a vendor. The missing step is not a better sync library, a more sophisticated CRDT, or a more polished local database. It is the composition of all the layers into a complete node - the composition that no system in this table currently delivers. +Every system that satisfies vendor-independent data ownership stops short of team collaboration. Every system that supports team collaboration delegates authority to a vendor. The missing step is not a better sync library, a more sophisticated CRDT, or a more polished local database. It is the composition of all the layers into a complete node — which no system in this table currently delivers. --- ## The Missing Step: Full Node, Not Smart Cache -The question that distinguishes this architecture from the approaches above is this: - -> What if a user's workstation ran a full node of the system - including state, business logic, and sync - such that "the cloud" is merely another peer, not the source of truth? - A smart cache knows what the server knows, slightly earlier. A full node knows what the user's data is. The distinction matters when the server is down, when the vendor goes away, when the network is unreachable, and when the user needs to understand, export, or migrate their data. -A full node runs five things locally: the presentation layer, the application logic, the sync daemon, the storage layer, and the security primitives. The cloud, where it appears at all, handles relay and backup - assistance for coordination and disaster recovery, not a source of truth. +A full node runs five things locally: the presentation layer, the application logic, the sync daemon, the storage layer, and the security primitives. The cloud, where it appears at all, handles relay and backup — assistance for coordination and disaster recovery, not a source of truth. -Consider what this changes for the field operation case. A construction superintendent's device running a smart-cache app can read recently synced records while offline. It cannot create a new inspection log against a work order that was not recently synced, because the work order's authoritative state lives on the server and the cache may be stale. It cannot run an automation that escalates an unresolved inspection to the site manager, because automations run server-side. When the sync eventually completes, there may be conflicts between the superintendent's offline writes and changes made by others - conflicts the smart-cache app resolves by whatever heuristic the vendor chose, without surfacing the conflict to the user. +Consider what this changes for the field operation case. A construction superintendent's device running a smart-cache app can read recently synced records while offline. It cannot create a new inspection log against a work order that was not recently synced, because the work order's authoritative state lives on the server and the cache may be stale. It cannot run an automation that escalates an unresolved inspection to the site manager, because automations run server-side. When sync eventually completes, the smart-cache app resolves conflicts using whatever heuristic the vendor chose, without surfacing them to the user. -A full node on the same device holds the complete relevant working set: all work orders the user is assigned to, all inspection logs for the current project, all assets in scope. It creates new records against local state and guarantees they will sync when connectivity returns. It runs business logic locally - the automation runs on the node, not on a server. When the sync completes, CRDT merge semantics handle concurrent edits with a defined and predictable strategy, surfacing genuine conflicts as a conflict inbox rather than silently picking a winner. +A full node on the same device holds the complete relevant working set: all work orders the user is assigned to, all inspection logs for the current project, all assets in scope. It creates new records against local state and guarantees they will sync when connectivity returns. It runs business logic locally. When sync completes, CRDT merge semantics handle concurrent edits with a defined and predictable strategy, surfacing genuine conflicts as a conflict inbox rather than silently picking a winner. -The full node does more than the smart cache not because it is smarter, but because it holds more data and carries more execution authority. The smart cache defers to a server it cannot reach. The full node acts on behalf of the user. +The full node does more than the smart cache not because it is smarter, but because it holds more data and carries more execution authority. The smart cache defers to a server it cannot reach; the full node acts on behalf of the user. -The pattern has operational precedent at scale. Modern point-of-sale systems - Square Reader and Toast - operate offline-first on the merchant's own device: a transaction recorded while the network is unreachable settles when connectivity returns, and the merchant's authoritative state advances against the local replica until then. Salesforce's Mobile SDK ships an offline-first object framework that field agents use to log work where signal is unreliable; conflict resolution surfaces to the agent rather than failing silently. These products demonstrate user-device-replica operation at commercial scale in domains where the cost of failed offline operation is concrete. What I describe in this dissertation generalizes that pattern beyond payments and field service to structured-data applications more broadly: typed records with evolving schemas, collaborative edits across multiple peers, and enterprise governance that survives procurement review. +The pattern has operational precedent at scale. Square Reader and Toast operate offline-first on the merchant's own device: a transaction recorded while the network is unreachable settles when connectivity returns. Salesforce's Mobile SDK ships an offline-first object framework that field agents use to log work where signal is unreliable; conflict resolution surfaces to the agent rather than failing silently. Both demonstrate user-device-replica operation at commercial scale in domains where failed offline operation has concrete cost. -This reframes what "offline support" means. Offline support in the smart-cache model means "some operations work offline, with degraded functionality." Offline support in the full-node model means "all operations work offline, identically." The distinction is not a feature comparison. It is a structural property that follows from where authority lives. +"Offline support" in the smart-cache model means some operations work offline, with degraded functionality. In the full-node model it means all operations work offline, identically. The distinction is not a feature comparison — it is a structural property that follows from where authority lives. -Every component of this model has a production analogue that validates it separately. CRDTs are production-ready: Linear's sync engine and Actual Budget's data model both use CRDT merge semantics in production, and the Automerge library is deployed in commercial collaborative applications - though Automerge users have to budget for known operational costs (document size growth with edit history, cold-sync time on long-lived documents, and garbage-collection cadence) that the library leaves to the application. Figma's multiplayer editor is not a pure CRDT deployment - its engineers describe it as "inspired by multiple separate CRDTs" over a server-authoritative, per-property merge - but it independently validates that per-property conflict resolution works for real-time collaborative editing at scale. Leaderless replication works at scale: Cassandra and DynamoDB rely on it. Desktop shell plus local server is a proven pattern: VS Code language servers and 1Password's local agent use it. Declarative partial sync is solved: PowerSync and ElectricSQL implement it. Silent background container services are normalized: Docker Desktop and Tailscale established the model. None of these components are speculative. My contribution is the *composition* - specifically, three pieces no other published architecture combines: a per-record CAP boundary that lets AP-class records and CP-class records coexist in one system, an MDM (Mobile Device Management)-deployable installer model that lets enterprise IT ship full-node software without bespoke onboarding, and an AGPLv3-with-managed-relay business model that makes the architecture economically viable without forcing vendor data custody. +Every component of this model has a production analogue. CRDTs are production-ready: Linear's sync engine and Actual Budget's data model both use CRDT merge semantics in production. The Automerge library is deployed in commercial collaborative applications, though users must budget for known operational costs — document size growth, cold-sync time on long-lived documents, and garbage-collection cadence — that the library leaves to the application. Figma's multiplayer editor independently validates that per-property conflict resolution works at scale. Leaderless replication works at scale: Cassandra and DynamoDB rely on it. Desktop shell plus local server is a proven pattern: VS Code language servers and 1Password's local agent use it. Declarative partial sync is solved: PowerSync and ElectricSQL implement it. Silent background container services are normalized: Docker Desktop and Tailscale established the model. ```mermaid graph TB @@ -178,38 +172,32 @@ graph TB --- -## What This Dissertation Adds - -The seven Kleppmann ideals [1] define the target. They do not tell you how to satisfy all seven simultaneously in a system that also passes enterprise procurement review, deploys via MDM, satisfies the compliance regimes that make local-first a legal requirement and not just a preference, handles key rotation when a team member leaves, migrates schema when nodes run different versions, survives a "couch device" returning after six months offline, and generates revenue that funds ongoing development. +## What This Book Adds -The regulatory pressure is now global, and the laws cluster by region. European regulation centers on the 2020 Schrems II ruling [4], which constrained transfers of EU personal data to US cloud providers without supplemental safeguards - making local-first residency a structural mechanism that addresses the data-transfer leg of GDPR analysis rather than an architectural preference, with national implementation guidance from Germany's BSI and France's CNIL. +The seven Kleppmann ideals [1] define the target. They do not tell you how to satisfy all seven simultaneously in a system that also passes enterprise procurement review, deploys via MDM, satisfies the compliance regimes that make local-first a legal requirement, handles key rotation when a team member leaves, migrates schema when nodes run different versions, survives a device returning after six months offline, and generates revenue that funds ongoing development. -The pattern repeats across regions with named regulators in each: India's DPDP Act 2023 [5] and the RBI's payment-data localization circular; the UAE's DIFC DPL 2020 [6]; Russia's Federal Law 242-FZ [7]; China's PIPL (Personal Information Protection Law) 2021; Brazil's LGPD (Lei Geral de Proteção de Dados); South Africa's POPIA (Protection of Personal Information Act); Nigeria's NDPR (Nigeria Data Protection Regulation); Japan's APPI (Act on the Protection of Personal Information); South Korea's PIPA (Personal Information Protection Act); and the GCC's emerging cluster (KSA's PDPL, Bahrain's PDPL). Each, in different language, treats data residency or controlled cross-border transfer as a compliance mechanism. The full coverage matrix across these and ~30+ other frameworks is in Appendix F. In the United States, HIPAA and SOC 2 frame the same structural argument through the healthcare and vendor-audit lenses. In each jurisdiction, an architecture where data lives on the user's own hardware is the architecture that makes compliance tractable. +The regulatory pressure is now global. European regulation centers on the 2020 Schrems II ruling [4], which constrained transfers of EU personal data to US cloud providers without supplemental safeguards — making local-first residency a structural mechanism that addresses the data-transfer leg of GDPR analysis, with national implementation guidance from Germany's BSI and France's CNIL. India's DPDP Act 2023 [5], the UAE's DIFC DPL 2020 [6], Russia's Federal Law 242-FZ [7], China's PIPL, Brazil's LGPD, South Africa's POPIA, Nigeria's NDPR, Japan's APPI, South Korea's PIPA, and the GCC's emerging cluster each treat data residency or controlled cross-border transfer as a compliance mechanism. The full coverage matrix is in Appendix F. In the United States, HIPAA and SOC 2 frame the same structural argument through the healthcare and vendor-audit lenses. In each jurisdiction, data on the user's own hardware is the architecture that makes compliance tractable. -The existing implementations - Automerge, Actual Budget, Linear's sync engine, Obsidian's local storage - each solve one part of this problem correctly. CRDTs handle concurrent merge. Local storage handles offline reads. Plain-file formats handle long-term portability. Fast local replicas handle perceived performance. None of them addresses the full set, and none provides the composition. +The existing implementations — Automerge, Actual Budget, Linear's sync engine, Obsidian's local storage — each solve one part of this problem correctly. CRDTs handle concurrent merge. Local storage handles offline reads. Plain-file formats handle long-term portability. Fast local replicas handle perceived performance. None addresses the full set, and none provides the composition. -The seven properties define target state. They do not tell you how to get there - what phases to sequence, what assumptions to validate, what to trade when two properties conflict, what to verify when you claim you are done. This dissertation is the plan that sits under the properties: phases in the five-layer stack and the deployment zones (Chapter 3, Chapter 4), adversarial validation in the council chapters (Part II), verification specification (Part III), and execution playbooks (Part IV). +Three disciplines separate working implementations from prototypes that stall. First, integration is where local-first projects die — every component exists in open source; wiring them with consistent invariants, especially CRDT epoch transitions across a Flease-coordinated subset of records, is engineering rather than research. Second, security is feasible only when novel cryptography is not generated: audited primitives (libsodium, age, Argon2id) used opaquely, with the DEK/KEK hierarchy composed against a specification a cryptographic engineer has reviewed. Third, long-term portability has one product-level decision that can kill the architecture alone — invent a wire format and repeat Anytype's Any-Block mistake, or adopt Yjs or Automerge and inherit their portability guarantees. The choice, not the invention, is what makes it feasible. -Three disciplines separate working implementations from prototypes that stall. First, integration is where local-first projects die - every component exists in open source; wiring them with consistent invariants, especially CRDT epoch transitions across a Flease-coordinated subset of records, is engineering rather than research. Second, Property 6 is feasible only when novel cryptography is not generated: audited primitives (libsodium, age, Argon2id reference) are used opaquely, and the DEK (Data Encryption Key)/KEK (Key Encryption Key) hierarchy composes them against a specification a cryptographic engineer has reviewed. Third, Property 5 has one product-level decision that can kill the architecture alone - invent a wire format and repeat Anytype's Any-Block mistake, or adopt Yjs ([github.com/yjs/yjs](https://github.com/yjs/yjs), the JavaScript CRDT library) or Automerge and inherit their portability guarantees. Feasibility is contingent on choosing, not inventing. +The contribution here is the composition. Not new primitives — every component has a production analogue. The CRDT merge semantics come from the Automerge and Yjs lineage. The gossip anti-entropy protocol comes from Cassandra and DynamoDB. The desktop shell plus local server pattern comes from VS Code and 1Password. The declarative partial sync model comes from PowerSync and ElectricSQL. The container-as-background-service model comes from Docker Desktop and Tailscale. The bidirectional schema lenses come from Ink & Switch's Cambria work. -My contribution is the composition. Not new primitives - every component in this architecture has a production analogue. The CRDT merge semantics come from the Automerge and Yjs lineage. The gossip anti-entropy protocol comes from Cassandra and DynamoDB. The desktop shell plus local server pattern comes from VS Code and 1Password. The declarative partial sync model comes from PowerSync and ElectricSQL. The container-as-background-service model comes from Docker Desktop and Tailscale. The bidirectional schema lenses come from Ink & Switch's Cambria work. - -What I assemble from those proven components: +What that assembly produces: - A node architecture with a stable microkernel and domain plugins under strict versioned contracts, so the system can evolve without breaking in-field deployments. - A per-record CAP positioning model that treats CRDT-merge records and lease-coordinated records as first-class distinct classes, with a defined boundary and a defined handoff between them. - A three-tier CRDT GC policy that keeps document growth bounded without sacrificing merge correctness for active peers. -- A key hierarchy - root organization key, per-role key encryption keys, per-document data encryption keys - that makes key rotation proportional to document count rather than document size, and makes member removal cryptographically effective rather than contractually promised. +- A key hierarchy — root organization key, per-role key encryption keys, per-document data encryption keys — that makes key rotation proportional to document count, and makes member removal cryptographically effective rather than contractually promised. - A schema migration strategy using expand-contract, bidirectional lenses, and epoch coordination that allows nodes running different schema versions to coexist on a live team. - An enterprise deployment model: MDM-compatible installers, SBOM (Software Bill of Materials) generation, code signing and notarization, air-gap operation, incident response runbooks. - A business model: AGPLv3 core, managed relay as the paid service, relay economics that become cash-flow positive before meaningful scale. - A governance model: foundation-backed structure, community contributor path, dual-license CLA for enterprise customers. -The managed relay is a residual vendor dependency the architecture does not eliminate - it disaggregates it. The relay holds ciphertext only. Data custody remains on user hardware, and the relay can be self-hosted without protocol changes. Chapter 3 specifies the relay's trust boundaries; Chapter 11 specifies its governance model. The distinction between SaaS vendor dependency and managed-relay dependency is not rhetorical: the former holds decryptable data; the latter does not. - -The architecture stands on the local-first community's work. The paper that named the seven ideals [1] is the benchmark against which my dissertation's design is measured throughout. The Ink & Switch essays on Automerge, Cambria, and collaborative document design are the intellectual foundation for the CRDT and schema evolution sections. Kleppmann's distributed systems work [2] provides the vocabulary that runs throughout Part III. +The managed relay is a residual vendor dependency the architecture does not eliminate — it disaggregates it. The relay holds ciphertext only. Data custody remains on user hardware, and the relay can be self-hosted without protocol changes. Chapter 3 specifies the relay's trust boundaries; Chapter 11 specifies its governance model. The distinction is not rhetorical: a SaaS vendor holds decryptable data; a managed relay does not. -The composition is the contribution. The next chapter shows what the complete stack looks like in a single diagram. Chapter 4 provides the decision framework for determining when this architecture is the right choice and when it is not. +The next chapter shows what the complete stack looks like in a single diagram. Chapter 4 provides the decision framework for when this architecture is the right choice and when it is not. --- diff --git a/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md b/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md index 916834c..ce38780 100644 --- a/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md +++ b/vol-1/part-1-thesis-and-pain/ch03-inverted-stack-one-diagram.md @@ -1,6 +1,6 @@ # Chapter 3 - The Inverted Stack in One Diagram - + @@ -9,16 +9,16 @@ ## The Inversion in One Sentence -Every architectural decision in this dissertation follows from one reversal of priority: +Every architectural decision in this book follows from one reversal of priority: -> **Conventional SaaS (Software as a Service):** Cloud database is primary - local device caches and renders. -> **Local-Node Architecture:** Local node is primary - cloud relay is an optional sync peer. +> **Conventional SaaS:** Cloud database is primary — local device caches and renders. +> **Local-Node Architecture:** Local node is primary — cloud relay is an optional sync peer. -In the conventional model, the local device is a thin client. It renders what the server says to render. It writes what the server accepts. Remove the server and the device has nothing - a shell waiting for instructions that will not arrive. +In the conventional model, the local device is a thin client. It renders what the server says to render. It writes what the server accepts. Remove the server and the device has nothing — a shell waiting for instructions that will not arrive. -In the local-node model, the device *is* the server. The local encrypted database holds the authoritative copy of the user's data. When peers are reachable, the node exchanges state with them. When no peers are reachable, the node operates at full fidelity. The node has no degraded mode (with one exception that earns its complexity: CP-class records that require distributed lease coordination - covered later in this chapter). It carries no dependency on any remote service for core function. +In the local-node model, the device *is* the server. The local encrypted database holds the authoritative copy of the user's data. When peers are reachable, the node exchanges state with them. When no peers are reachable, the node operates at full fidelity. The node has no degraded mode — with one exception: CP-class records that require distributed lease coordination, covered later in this chapter. It carries no dependency on any remote service for core function. -The architecture resolves into one mental model that the principal diagram below anchors. Supporting diagrams in this chapter visualize specific layer interactions; the principal diagram is what the reader holds. +The architecture resolves into one mental model anchored by the principal diagram below. ```mermaid graph LR @@ -47,7 +47,7 @@ Primary: Node B")] end ``` -The relay is optional. Two nodes on the same LAN sync directly via mDNS peer discovery, with no relay in the path at all. The relay exists to help nodes find each other across NAT boundaries, not to hold their data. If the relay goes down, nodes fall back to direct peer-to-peer communication on the local network. If that also fails, they work offline and catch up when connectivity returns. +The relay is optional. Two nodes on the same LAN sync directly via mDNS peer discovery, with no relay in the path. The relay exists to help nodes find each other across NAT boundaries, not to hold their data. If the relay goes down, nodes fall back to direct peer-to-peer communication on the local network. If that also fails, they work offline and catch up when connectivity returns. This is the inversion. Everything else is implementation. @@ -55,7 +55,7 @@ This is the inversion. Everything else is implementation. ## The Five Layers -The inversion is one sentence. The five-layer model is why that sentence is implementable - the specific form the architecture takes when each property of the SaaS bundle is delivered without vendor data custody. Each layer has a clear owner. Each layer has a clear boundary. Each layer has an answer to the question every distributed system must answer: what happens when the network is unavailable? +The inversion is one sentence. The five-layer model is why that sentence is implementable — the specific form the architecture takes when each property of the SaaS bundle is delivered without vendor data custody. Each layer has a clear owner, a clear boundary, and an answer to the question every distributed system must answer: what happens when the network is unavailable? ```mermaid graph TB @@ -84,30 +84,30 @@ Peer Discovery · NAT Traversal"] ### Layer 1: Presentation -The presentation layer renders what the local store contains. That is its entire job. It owns no state. It caches nothing independently. It makes no decisions about data. +The presentation layer renders what the local store contains. It owns no state, caches nothing independently, and makes no decisions about data. -In the Zone A accelerator (the Anchor pattern - offline-by-default local-first desktop), this layer is a .NET MAUI (.NET Multi-platform App UI) Blazor Hybrid shell - a native application window embedding a Blazor WebView that renders Razor components backed by local data. The component surface is identical to the Zone C accelerator (the comms mesh pattern - hybrid multi-tenant SaaS) browser shell: the same `Harborline.UICore` and `Harborline.UIAdapters.Blazor` components render whether the node is a local desktop installation or a hosted tenant instance. This is deliberate. If a UI component only works against a cloud backend, it has not been designed correctly for this architecture. +In the Zone A accelerator (the Anchor pattern), this layer is a .NET MAUI Blazor Hybrid shell: a native application window embedding a Blazor WebView that renders Razor components backed by local data. The component surface is identical to the Zone C accelerator (the comms mesh pattern) browser shell. The same `Harborline.UICore` and `Harborline.UIAdapters.Blazor` components render whether the node is a local desktop installation or a hosted tenant instance. A UI component that only works against a cloud backend has not been designed correctly for this architecture. -The presentation layer's primary local-first responsibility is status indication. Users should always know the state of their data without interrogating it. The `SunfishNodeHealthBar` component (`Harborline.UIAdapters.Blazor`; pre-1.0) surfaces four states: +The presentation layer's primary local-first responsibility is status indication. The `SunfishNodeHealthBar` component (`Harborline.UIAdapters.Blazor`; pre-1.0) surfaces four states: - **Sync-healthy:** The node is connected to at least one peer and has exchanged a recent delta. - **Stale:** The node has not synced within its configured freshness threshold; local data may lag behind changes made by others. -- **Offline:** No peers are reachable. The node is operating on its own authoritative copy. +- **Offline:** No peers are reachable. The node operates on its own authoritative copy. - **Conflict-pending:** One or more records have diverged from a peer version and require resolution. -Each state must be communicated through more than color. The `SunfishNodeHealthBar` sets `SemanticProperties.Description` to a text equivalent for each state - screen readers announce the current sync status without requiring the user to inspect the color indicator. State transitions trigger a live region announcement, so an AT user receives the same notification a sighted user receives visually. The full accessibility specification appears in Chapter 20. +Each state communicates through more than color. The component sets `SemanticProperties.Description` to a text equivalent — screen readers announce sync status without requiring the user to inspect the color indicator. State transitions trigger a live region announcement. The full accessibility specification is in Chapter 20. -When the network is unavailable, the presentation layer changes nothing about its behavior. It continues to render from the local store. The status indicator moves from sync-healthy to offline. The user can still create records, navigate, query, and run any domain workflow that does not require distributed lease coordination. They receive no error page. No spinner. No apology. The software works. +When the network is unavailable, the presentation layer changes nothing. It continues to render from the local store. The status indicator moves to offline. The user creates records, navigates, queries, and runs any domain workflow that does not require distributed lease coordination. No error page. No spinner. No apology. ### Layer 2: Application Logic -The application logic layer runs domain business rules. Command handlers receive user intent and translate it into CRDT (Conflict-free Replicated Data Type) operations and domain events. The layer determines what constitutes a valid state transition, enforces invariants, and emits events that both the local store and the sync daemon consume. +The application logic layer runs domain business rules. Command handlers receive user intent and translate it into CRDT (Conflict-free Replicated Data Type) operations and domain events. The layer enforces invariants and emits events that both the local store and the sync daemon consume. -This layer holds no network-aware code. It does not know whether the sync daemon is connected to peers. It writes to the local CRDT store unconditionally - the sync daemon propagates those writes when it can, not when consulted before they happen. This is the property that makes full offline operation possible: business logic executes against local state, not against a remote lock or a remote validation service. +This layer holds no network-aware code. It does not know whether the sync daemon is connected to peers. It writes to the local CRDT store unconditionally — the sync daemon propagates those writes when it can. This is the property that makes full offline operation possible: business logic executes against local state, not against a remote lock or validation service. -The one exception is CP-class records - those whose correctness requires distributed coordination, such as resource reservations, financial postings, and scheduled slots where double-booking is worse than unavailability. For these records, the application logic layer consults the sync daemon lease coordinator before writing. If quorum is unreachable, the write blocks and the UI surfaces a clear indicator. This is an explicit design choice. The user sees a constraint, not a mystery failure. +The one exception is CP-class records — those whose correctness requires distributed coordination: resource reservations, financial postings, and scheduled slots where double-booking is worse than unavailability. For these, the application logic layer consults the sync daemon lease coordinator before writing. If quorum is unreachable, the write blocks and the UI surfaces a clear indicator. The user sees a constraint, not a mystery failure. -The CAP positioning is per record class, not per application: +CAP positioning is per record class, not per application: | Record Class | CAP Position | Why | |---|---|---| @@ -118,92 +118,90 @@ The CAP positioning is per record class, not per application: ### Layer 3: Sync Daemon -The sync daemon is a separate long-running process. It is not a thread in the application. It is not a hosted service that stops when the application window closes. It registers with the OS service manager and runs continuously from login, communicating with the application shell through a Unix domain socket. When the application restarts after a crash, the sync daemon has already been collecting deltas from peers - the application reconnects to a daemon that has been working the whole time. +The sync daemon is a separate long-running process — not a thread in the application, not a hosted service that stops when the application window closes. It registers with the OS service manager and runs continuously from login, communicating with the application shell through a Unix domain socket. When the application restarts after a crash, the sync daemon has already been collecting deltas from peers. The daemon manages five concerns: -**Peer discovery.** Discovery follows a three-tier hierarchy. On the local network, mDNS provides zero-configuration discovery - two devices on the same Wi-Fi segment find each other automatically when the network permits multicast. (Many enterprise Wi-Fi configurations filter mDNS by default; on those networks, the next tier is the path that actually works.) Across networks, a mesh VPN layer (WireGuard-based) handles NAT traversal without port forwarding. For teams where neither tier is viable, the managed relay provides a final option. +**Peer discovery.** On the local network, mDNS provides zero-configuration discovery — two devices on the same Wi-Fi segment find each other automatically when the network permits multicast. Across networks, a mesh VPN layer (WireGuard-based) handles NAT traversal without port forwarding. For teams where neither tier is viable, the managed relay provides a final option. -**Gossip anti-entropy.** Every 30 seconds, the daemon selects two random peers from its membership list and exchanges a delta - the operations each holds that the other lacks. Vector clocks scoped per-document (one entry per peer that has produced operations on that document) track what each peer has seen. This is the same anti-entropy mechanism used by large-scale distributed databases [2]; on a five-person team, it runs across workstations with no infrastructure required. +**Gossip anti-entropy.** Every 30 seconds, the daemon selects two random peers from its membership list and exchanges a delta — the operations each holds that the other lacks. Vector clocks scoped per-document track what each peer has seen. The same anti-entropy mechanism underpins large-scale distributed databases [2]; on a five-person team, it runs across workstations with no infrastructure required. -**Delta streaming.** After the gossip protocol identifies divergence, the daemon streams the missing CRDT operations to each peer. The protocol wire format is CBOR (Concise Binary Object Representation) - compact binary encoding that minimizes bandwidth on the intermittent connections that are the baseline operating condition for hundreds of millions of enterprise workers worldwide, not an edge case. +**Delta streaming.** After the gossip protocol identifies divergence, the daemon streams the missing CRDT operations to each peer. The wire format is CBOR (Concise Binary Object Representation) — compact binary encoding that minimizes bandwidth on intermittent connections. -**Flease lease coordination.** For CP-class records, the daemon participates in distributed lease negotiation. When a node needs to write a resource reservation or financial posting, it broadcasts a lease request. The lease is granted when a quorum of reachable peers acknowledges - the safety guarantee being that two competing leases cannot both reach majority quorum on the same configured peer set, so the system never grants two contradictory leases simultaneously. Default lease duration is 30 seconds, derived in Chapter 14 from the Flease algorithm's quorum-acknowledgment window under the reference network model. A node that goes offline releases its lease at expiry - the team is never permanently blocked by one disconnected device. +**Flease lease coordination.** For CP-class records, the daemon participates in distributed lease negotiation. When a node needs to write a resource reservation or financial posting, it broadcasts a lease request. The lease is granted when a quorum of reachable peers acknowledges — the safety guarantee being that two competing leases cannot both reach majority quorum on the same configured peer set. Default lease duration is 30 seconds, derived in Chapter 14 from the Flease algorithm's quorum-acknowledgment window. A node that goes offline releases its lease at expiry; the team is never permanently blocked by one disconnected device. -**Write buffering.** When no peers are reachable, the daemon continues accepting writes from the application logic layer and buffering them to durable local storage. Buffered writes commit to the local event log before acknowledgment. A power interruption between buffering and peer delivery does not lose data. The moment a peer becomes reachable - on the LAN, via VPN, or via the managed relay - the daemon begins working through the buffer. The application never needs to know that writes were queued. +**Write buffering.** When no peers are reachable, the daemon continues accepting writes from the application logic layer and buffering them to durable local storage. Buffered writes commit to the local event log before acknowledgment — a power interruption between buffering and peer delivery does not lose data. The moment a peer becomes reachable, the daemon begins working through the buffer. The application never needs to know that writes were queued. ### Layer 4: Storage -Layer 4 is the source of truth for this node. Everything the presentation layer renders, everything the application logic layer reads, comes from here. Nothing here depends on a remote service. +Layer 4 is the source of truth for this node. The presentation layer renders from here. The application logic layer reads from here. Nothing here depends on a remote service. -The primary store is SQLite encrypted with SQLCipher. The encryption key is derived from user credentials using Argon2id and stored in the OS-native keystore - the macOS Keychain, Windows Credential Manager, or equivalent. Physical storage extraction without user credentials yields nothing readable. +The primary store is SQLite encrypted with SQLCipher. The encryption key is derived from user credentials using Argon2id and stored in the OS-native keystore — the macOS Keychain, Windows Credential Manager, or equivalent. Physical storage extraction without user credentials yields ciphertext. Three storage structures coexist: -**The CRDT document store** holds all AP-class data as typed CRDT documents. Map documents hold structured records. List documents hold ordered sequences. Text documents hold rich text. The CRDT library handles merge semantics - the merge function is commutative, associative, and idempotent, so any two diverged copies of a document produce the same merged result regardless of merge order. The Harborline Shipyard reference implementation currently ships YDotNet (a .NET port of Yjs); Loro is the aspirational target when its C# bindings mature. The `ICrdtEngine` abstraction keeps that choice reversible. (See Appendix G for the full glossary of these libraries and their licenses.) +**The CRDT document store** holds all AP-class data as typed CRDT documents. Map documents hold structured records. List documents hold ordered sequences. Text documents hold rich text. The merge function is commutative, associative, and idempotent — any two diverged copies produce the same merged result regardless of merge order. The Harborline Shipyard reference implementation ships YDotNet (a .NET port of Yjs); Loro is the aspirational target. The `ICrdtEngine` abstraction keeps that choice reversible. -**The event log** is an append-only sequence of every domain event and CRDT operation the node has ever processed. It never modifies past entries. Current aggregate state derives from replaying this log from the most recent snapshot. This structure provides corruption resistance, point-in-time recovery, and the audit trail that regulated industries require. +**The event log** is an append-only sequence of every domain event and CRDT operation the node has ever processed. Current aggregate state derives from replaying this log from the most recent snapshot. This structure provides corruption resistance, point-in-time recovery, and the audit trail regulated industries require. -**Read-model projections** are materialized views derived from the event log - the tables, indexes, and calculated fields that make queries fast. If a projection becomes corrupted or stale, it is rebuilt from the event log. The event log is the ground truth. Projections are a performance optimization. +**Read-model projections** are materialized views derived from the event log — tables, indexes, and calculated fields that make queries fast. A corrupted or stale projection rebuilds from the event log. Projections are a performance optimization; the event log is the ground truth. ### Layer 5: Relay and Discovery Layer 5 is the only layer that touches infrastructure outside the local node, and it is optional. -The relay's job is narrow: receive encrypted CRDT deltas from one peer, fan them out to co-subscribed peers, and provide a rendezvous point for peer discovery in environments where mDNS and mesh VPN do not reach. The relay holds no authoritative data. It stores no decrypted content. It cannot read the payloads it routes - every delta arrives as ciphertext produced by the sender's DEK (Data Encryption Key)/KEK (Key Encryption Key) encryption layer, and the relay has no access to any key. +The relay's job is narrow: receive encrypted CRDT deltas from one peer, fan them out to co-subscribed peers, and provide a rendezvous point for peer discovery in environments where mDNS and mesh VPN do not reach. The relay stores no decrypted content. Every delta arrives as ciphertext produced by the sender's DEK (Data Encryption Key)/KEK (Key Encryption Key) encryption layer; the relay holds no key. -The relay's two default trust levels reflect this: +The relay's two default trust levels: -- **Relay-only (default):** The relay receives and routes ciphertext. It cannot decrypt anything. This is the maximum-privacy configuration that satisfies data sovereignty requirements without exception. -- **Attested hosted peer (opt-in):** An administrator explicitly issues the hosted relay node a role attestation, making it a full peer. This enables the relay to participate in quorum for CP-class lease coordination - useful for teams too small to form quorum from workstations alone. +- **Relay-only (default):** The relay receives and routes ciphertext. It cannot decrypt anything. This is the maximum-privacy configuration and satisfies data sovereignty requirements without exception. +- **Attested hosted peer (opt-in):** An administrator issues the hosted relay node a role attestation, making it a full peer. This enables the relay to participate in quorum for CP-class lease coordination — useful for teams too small to form quorum from workstations alone. -The relay protocol is open and the relay is self-hostable. Any organization that requires full independence from managed relay infrastructure can operate its own relay with no changes to node configuration. +The relay protocol is open and the relay is self-hostable. Organizations that require full independence from managed relay infrastructure can operate their own relay with no changes to node configuration. -A note on what "optional" means in practice. The relay is *architecturally* optional - the protocol does not require it, two nodes on the same LAN sync directly via mDNS, and a small team whose members all work from one office can run indefinitely without any relay at all. The relay is *operationally* mandatory for the modal team in this dissertation's audience: members across symmetric NATs, members on cellular networks, members on different corporate Wi-Fi networks where mDNS is filtered. For those teams, the relay is what lets two members reach each other when neither is on the same LAN. The architecture does not pretend otherwise; the distinction matters because operational planning has to account for relay availability the same way it accounts for any other shared infrastructure component, even when the relay is self-hosted on the team's own VPS. Fleet observability - relay availability, peer reachability, sync health across the fleet - is what the operator monitors; Chapter 21 specifies the fleet observability primitives. - -The relay's failure is not the application's failure. +The relay is architecturally optional — the protocol does not require it, and a small team whose members all work from one office can run indefinitely without one. The relay is operationally required for the modal team this book addresses: members across symmetric NATs, on cellular networks, or on separate corporate Wi-Fi networks where mDNS is filtered. Operational planning must account for relay availability the same way it accounts for any other shared infrastructure component, even when the relay is self-hosted. The relay's failure is not the application's failure. --- ## How This Changes Failure Modes -Chapter 1 named seven failure modes. The inversion addresses each of them specifically. There are also failure modes the SaaS model created that may not have been visible as such - they only become legible once you understand what the vendor was holding on your behalf. And there are new failure modes the inverted architecture introduces. All three categories deserve honest treatment. +Chapter 1 named seven failure modes. The inversion addresses each directly. There are also failure modes the SaaS model created that only become legible once you understand what the vendor was holding on your behalf. And the inverted architecture introduces failure modes of its own. All three categories deserve honest treatment. **What the inversion resolves:** -*The Outage and The Dependency Chain.* The local node holds authoritative state on the device. No upstream failure - your vendor's, or the cloud region beneath your vendor - interrupts it. A relay outage is an inconvenience. Nodes on the same LAN continue syncing directly. Cross-network nodes catch up when the relay recovers. A relay outage is not a data event. The construction PM submitting a bid at 4:58 PM does not care whether a cloud region is degraded, because his node does not consult any remote service to function. +*The Outage and The Dependency Chain.* The local node holds authoritative state on the device. No upstream failure — your vendor's, or the cloud region beneath your vendor — interrupts it. A relay outage is an inconvenience. Nodes on the same LAN continue syncing directly. Cross-network nodes catch up when the relay recovers. A relay outage is not a data event. *The Vendor.* Data on vendor infrastructure is at the vendor's business decision's mercy. Data on the user's hardware is not. A vendor acquisition, pivot, or shutdown interrupts the sync service. It does not interrupt access to the user's data. -*The Connectivity.* SaaS requires a persistent connection because the cloud database holds the authoritative copy. The local node holds its own authoritative copy. Connectivity enables sync. It is not a prerequisite for function. The operational precedent is African mobile money: M-PESA and MTN MoMo have operated offline-tolerant financial transaction architectures at continental scale for over fifteen years, demonstrating that the pattern works at population scale in the markets that most require it. +*The Connectivity.* SaaS requires a persistent connection because the cloud database holds the authoritative copy. The local node holds its own authoritative copy — connectivity enables sync; it is not a prerequisite for function. The precedent is African mobile money: M-PESA and MTN MoMo have operated offline-tolerant financial transaction architectures at continental scale for over fifteen years. -*The Data.* Vendor-managed data is portable only on vendor terms - export rate limits, proprietary formats, feature-gated access. Data on the local node is accessible to the user at any time, in a standard format, without vendor participation. Chapter 16 specifies the plain-file export path and the non-technical disaster recovery walkthrough. +*The Data.* Vendor-managed data is portable only on vendor terms — export rate limits, proprietary formats, feature-gated access. Data on the local node is accessible to the user at any time, in a standard format, without vendor participation. Chapter 16 specifies the plain-file export path and the non-technical disaster recovery walkthrough. -*The Price.* Pricing leverage depends on switching costs that compound when data and workflows are entangled with vendor infrastructure. The relay - the one remaining billable dependency - is replaceable. The data custody that makes price changes coercive is removed from the equation. +*The Price.* Pricing leverage depends on switching costs that compound when data and workflows are entangled with vendor infrastructure. The relay — the one remaining billable dependency — is replaceable. The data custody that makes price changes coercive is gone. -*The Drift.* Silent corruption and silent divergence are the SaaS failure mode the user catches last and trusts the system about most. The architecture I propose makes the convergence-or-divergence question first-class at the data layer rather than implicit in vendor behavior. CRDT merge semantics produce deterministically convergent state across peers - no silent winner-takes-all resolution. AP-class records that genuinely diverge surface in the conflict inbox as a structured choice, not as a quiet overwrite. CP-class records use distributed lease coordination to refuse contradictory writes at the moment they would create the divergence, rather than accepting both and discovering the inconsistency later. The convergence semantics are testable, the divergence cases are observable, and the resolution is auditable. The cost: developers have to model their domain in operations rather than current-state assignments. Chapter 12 specifies the CRDT engine; Chapter 13 specifies the conflict UX. +*The Drift.* Silent corruption and silent divergence are the SaaS failure mode the user catches last and trusts the system about most. CRDT merge semantics produce deterministically convergent state across peers — no silent winner-takes-all resolution. AP-class records that genuinely diverge surface in the conflict inbox as a structured choice, not a quiet overwrite. CP-class records use distributed lease coordination to refuse contradictory writes at the moment they would create divergence. The convergence semantics are testable, divergence cases are observable, and resolution is auditable. The cost: developers must model their domain in operations rather than current-state assignments. Chapters 12 and 13 specify the CRDT engine and the conflict UX. -*The Third-Party Veto.* In 2022, Western SaaS vendors suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement. Hundreds of thousands of organizations that had built workflows on those platforms found their operations interrupted - not because their vendors failed them, but because their vendors were directed to stop serving them. A local-node architecture does not eliminate this vector entirely. A relay can be targeted. The software vendor itself can be targeted. But the architecture disaggregates exposure: data on user hardware is not reachable by acting on the relay operator, and the relay can be self-hosted or replaced for the highest-sensitivity deployments. Chapter 11 specifies relay governance. Chapter 15 covers the compliance framework for the customer-directed variant of this failure mode. +*The Third-Party Veto.* In 2022, Western SaaS vendors suspended service across Russia and CIS markets under sanctions enforcement. Organizations that had built workflows on those platforms found their operations interrupted — not because their vendors failed, but because their vendors were directed to stop serving them. A local-node architecture does not eliminate this vector — a relay can be targeted, the software vendor itself can be targeted — but the architecture disaggregates exposure: data on user hardware is not reachable by acting on the relay operator, and the relay can be self-hosted for the highest-sensitivity deployments. Chapters 11 and 15 cover relay governance and the compliance framework. -The regulatory landscape this failure mode operates in is worth naming. The dominant European driver is the EU Court of Justice's 2020 Schrems II ruling, which constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards - the strongest European legal argument for local-first data residency, enforced nationally by Germany's BSI (Bundesamt für Sicherheit in der Informationstechnik) and France's CNIL (Commission nationale de l'informatique et des libertés). India's DPDP Act 2023 and the RBI's payment-data localization circular, China's PIPL (Personal Information Protection Law) 2021, Russia's Federal Law 242-FZ (Russian-citizen personal data on Russian territory since 2015), the UAE's DIFC DPL 2020, Brazil's LGPD, South Africa's POPIA, Nigeria's NDPR, Japan's APPI, South Korea's PIPA, and the GCC's PDPL cluster (KSA, Bahrain) are representative of the parallel pattern across GCC, APAC, African, and Americas markets; the full coverage matrix is in Appendix F. In each jurisdiction, an architecture where data lives on the user's own hardware is the architecture that makes compliance tractable. One nuance worth flagging: when peer nodes reside in different jurisdictions, a direct peer-to-peer sync becomes a cross-border data transfer in legal terms, even when the data is encrypted in transit and never lands on a vendor server. Chapter 15 specifies the compliance framework for that case. +The dominant regulatory driver for data residency is the EU Court of Justice's 2020 Schrems II ruling, which constrained EU organizations from transferring personal data to US cloud providers without adequate supplemental safeguards. India's DPDP Act 2023, China's PIPL 2021, Brazil's LGPD, and analogous frameworks across APAC and GCC markets follow the same structural logic. The full coverage matrix is in Appendix F. When peer nodes reside in different jurisdictions, a direct peer-to-peer sync constitutes a cross-border data transfer in legal terms, even when encrypted and never touching a vendor server. Chapter 15 specifies the compliance framework for that case. **What you may not have noticed you were exposed to:** -*The Security Breach.* Every SaaS vendor holds decryptable copies of everything you have stored with them. A breach anywhere in their infrastructure stack - servers, sub-processors, privileged internal access - is a breach of your data, regardless of any action you took or failed to take. This failure mode is invisible until it has already happened. You cannot evaluate a vendor's internal security posture from outside it. In this architecture, the relay holds only ciphertext: it receives post-encryption deltas sealed under per-document DEKs wrapped by role KEKs, with keys that never leave the originating node. A complete breach of the relay infrastructure exposes nothing. There is no decryptable content to exfiltrate. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, end-to-end encryption with keys that never leave the originating device addresses a compliance constraint that cloud storage cannot satisfy architecturally. The attack surface moves to the endpoints - which this architecture addresses explicitly rather than hiding. +*The Security Breach.* Every SaaS vendor holds decryptable copies of everything you stored with them. A breach anywhere in their infrastructure stack — servers, sub-processors, privileged internal access — is a breach of your data, regardless of any action you took. In this architecture, the relay holds only ciphertext: post-encryption deltas sealed under per-document DEKs wrapped by role KEKs, with keys that never leave the originating node. A complete breach of the relay infrastructure exposes nothing. In jurisdictions where cloud-hosted infrastructure is subject to mandatory government access requirements, end-to-end encryption with keys that never leave the originating device addresses a compliance constraint that cloud storage cannot satisfy architecturally. -Hayoon Kim found out about her vendor's breach at a hotel in Singapore at 6:47 in the morning, sitting on the edge of a bed she had not slept in, reading an article in *Hankyoreh* that named her by name. Hayoon ran a one-person ISMS-P (Information Security Management System - Personal) consultancy out of Gangnam-gu in Seoul. Her practice management SaaS - a Korean-language platform serving a few thousand domestic compliance professionals - had been breached six weeks earlier. The breach was disclosed to customers via an email that landed in her promotions folder. Hayoon never saw it. The article was the disclosure that reached her. Eleven of her clients were named on the dump that surfaced overnight on a Russian-language forum, each report carrying her name on the cover page, each report listing the specific PIPA (Personal Information Protection Act) Article 29 safety-measure controls she had documented during her 2023 audit work. +Hayoon Kim found out about her vendor's breach at 6:47 in the morning at a hotel in Singapore, sitting on the edge of a bed she had not slept in, reading an article in *Hankyoreh* that named her by name. Hayoon ran a one-person ISMS-P consultancy out of Gangnam-gu in Seoul. Her practice management SaaS had been breached six weeks earlier. The vendor disclosed by email; the email landed in her promotions folder. The article was the disclosure that reached her. Eleven of her clients appeared in the overnight dump on a Russian-language forum, each report carrying her name on the cover page, each listing the specific PIPA Article 29 controls she had documented during her 2023 audit work. -She spent the next eleven days drafting individual letters to each affected client explaining what had happened, what data was exposed, what they should do. She had spent her career advising other organizations on this exact kind of letter. Writing eleven of them about her own practice was a different exercise. The platform vendor's chief executive sent a personal apology that was identical, paragraph for paragraph, to an apology another vendor's chief executive had sent the year before - Hayoon recognized three of the sentences from a precedent she had cited in a 2022 article she had written for the Korea Internet & Security Agency's quarterly compliance bulletin. +She spent the next eleven days drafting individual letters to each affected client. She had spent her career advising other organizations on exactly this kind of letter. The platform vendor's CEO sent a personal apology identical, paragraph for paragraph, to an apology another vendor's CEO had sent the year before — Hayoon recognized three sentences from a precedent she had cited in a 2022 article for the Korea Internet & Security Agency's quarterly compliance bulletin. -She still keeps her active client documents on a local encrypted drive that no SaaS vendor has access to. The architecture, she will tell anyone who asks, is what she would have wanted before. Nobody ever asks. +She still keeps her active client documents on a local encrypted drive. The architecture, she will tell anyone who asks, is what she would have wanted before. Nobody ever asks. **What the architecture introduces honestly:** -*Endpoint compromise expands the attack surface.* A centralized cloud database is a single high-value target behind enterprise controls. A fleet of workstations is a larger attack surface with heterogeneous security posture. SQLCipher encryption at rest limits the damage from physical device loss - storage extraction without credentials yields ciphertext. But a compromised running node, with the user authenticated, holds live key material in memory. The four-layer defense - encryption at rest, field-level encryption for high-sensitivity records, stream-level data minimization at the sync layer, and circuit breaker quarantine for offline writes - reduces the blast radius per compromised endpoint. It does not eliminate endpoint risk. Chapter 7 addresses the threat model and the key hierarchy. +*Endpoint compromise expands the attack surface.* A centralized cloud database is a single high-value target behind enterprise controls. A fleet of workstations is a larger attack surface with heterogeneous security posture. SQLCipher encryption at rest limits the damage from physical device loss. A compromised running node, with the user authenticated, holds live key material in memory. The four-layer defense — encryption at rest, field-level encryption for high-sensitivity records, stream-level data minimization at the sync layer, and circuit breaker quarantine for offline writes — reduces the blast radius per compromised endpoint. It does not eliminate endpoint risk. Chapter 7 addresses the threat model and the key hierarchy. -*Schema migration complexity increases.* In a centralized SaaS deployment, a schema migration runs once against one database. In a local-node architecture, nodes update independently. A twenty-person team may run five schema versions simultaneously. The expand-contract pattern - new fields additive and backward-compatible during a compatibility window, old fields retired once all active nodes have updated - handles incremental change. Bidirectional lenses handle structural transformations. Schema epochs coordinate breaking changes via quorum agreement. The complexity is real and manageable. It is also categorically harder than single-database migration. Chapter 13 specifies every mechanism. +*Schema migration complexity increases.* In a centralized SaaS deployment, a schema migration runs once against one database. In a local-node architecture, nodes update independently — a twenty-person team may run five schema versions simultaneously. The expand-contract pattern handles incremental change. Bidirectional lenses handle structural transformations. Schema epochs coordinate breaking changes via quorum agreement. The complexity is real and manageable. It is also categorically harder than single-database migration. Chapter 13 specifies every mechanism. -*CRDT GC debt accumulates.* A CRDT document records every operation in its history. Without garbage collection, a high-churn document grows without bound. The three-tier GC policy - aggressive compaction for stable documents, 90-day retention for active collaboration documents (configurable per deployment; Chapter 6 derives the default), indefinite retention for compliance-classified records bounded in practice by jurisdiction-specific schedules (six years for HIPAA (Health Insurance Portability and Accountability Act), seven for SOX, as configured) - keeps growth bounded. But GC in a peer-to-peer system requires coordination. A peer offline for three months may return with operations that reference a history the active peers have already compacted. The stale peer recovery protocol handles this case. Chapter 6 covers the failure scenarios. CRDT GC is a real operational concern. This architecture addresses it. It does not make it disappear. +*CRDT GC debt accumulates.* A CRDT document records every operation in its history. Without garbage collection, a high-churn document grows without bound. The three-tier GC policy — aggressive compaction for stable documents, 90-day retention for active collaboration documents, indefinite retention for compliance-classified records bounded by jurisdiction-specific schedules — keeps growth bounded. A peer offline for three months may return with operations that reference a history the active peers have already compacted. The stale peer recovery protocol handles this case. Chapter 6 covers the failure scenarios. CRDT GC is a real operational concern. The architecture addresses it; it does not make it disappear. Part II is six rounds of adversarial review by people who were looking for exactly these problems. @@ -213,11 +211,11 @@ Part II is six rounds of adversarial review by people who were looking for exact The five-layer model admits two canonical deployment shapes. Both use the same Harborline component surface, the same sync protocol, and the same five-layer architecture. They differ in where the authoritative data location lives. -**Zone A** (the Anchor pattern) is offline-by-default local-first. It targets .NET MAUI Blazor Hybrid - a native application embedding a Blazor WebView, running on Windows and macOS desktops. Data lives in a local SQLite database encrypted with SQLCipher. Device identity is a long-lived Ed25519 keypair generated at first run and stored in the OS keystore. Sync is opt-in. A user who never enables sync has a fully functional local application. A user who enables sync connects to a managed relay or a direct peer via the gossip protocol. Zone A is the right shape for professional service firms, field operations, and any environment where network connectivity is unreliable, regulated, or genuinely unavailable. The Harborline Shipyard `accelerators/anchor/` directory is the reference implementation - pre-1.0, in active development. +**Zone A** (the Anchor pattern) is offline-by-default local-first. It targets .NET MAUI Blazor Hybrid — a native application embedding a Blazor WebView, running on Windows and macOS desktops. Data lives in a local SQLite database encrypted with SQLCipher. Device identity is a long-lived Ed25519 keypair generated at first run and stored in the OS keystore. Sync is opt-in. A user who never enables sync has a fully functional local application. Zone A is the right shape for professional service firms, field operations, and any environment where network connectivity is unreliable, regulated, or genuinely unavailable. The Harborline Shipyard `accelerators/anchor/` directory is the reference implementation — pre-1.0, in active development. -**Zone C** (the comms mesh pattern) is hybrid multi-tenant SaaS. It targets .NET Aspire with a Blazor Server shell and handles multiple commercial tenants with per-tenant data-plane isolation. Each tenant gets a dedicated local-node host process and a dedicated SQLCipher database. The hosted node participates in the tenant's gossip scope as a ciphertext-only peer by default - it routes encrypted deltas but cannot read them. Tenants who need the hosted node to participate in quorum for CP-class operations can issue it a role attestation explicitly. Zone C is the right shape for organizations that want the deployment simplicity of a hosted service alongside the data sovereignty guarantees of a local-node architecture. The Harborline Shipyard `accelerators/bridge/` directory is the reference implementation - pre-1.0, in active development. +**Zone C** (the comms mesh pattern) is hybrid multi-tenant SaaS. It targets .NET Aspire with a Blazor Server shell and handles multiple commercial tenants with per-tenant data-plane isolation. Each tenant gets a dedicated local-node host process and a dedicated SQLCipher database. The hosted node participates in the tenant's gossip scope as a ciphertext-only peer by default. Tenants who need the hosted node to participate in quorum for CP-class operations can issue it a role attestation explicitly. Zone C is the right shape for organizations that want deployment simplicity alongside the data sovereignty guarantees of a local-node architecture. The Harborline Shipyard `accelerators/bridge/` directory is the reference implementation — pre-1.0, in active development. -Both shapes use `Harborline.Kernel.Sync` and `Harborline.Foundation.LocalFirst` (pre-1.0). Neither shape changes the sync protocol, the CAP positioning model, or the storage architecture. The difference between Zone A and Zone C is not two different systems. It is one system instantiated at two different authoritative data locations. A developer who understands the five layers understands both shapes. The choice between them is a deployment decision. Chapter 4 provides the framework for making it. +The difference between Zone A and Zone C is not two different systems. It is one system instantiated at two different authoritative data locations. A developer who understands the five layers understands both shapes. The choice between them is a deployment decision. Chapter 4 provides the framework for making it. --- @@ -225,11 +223,11 @@ Both shapes use `Harborline.Kernel.Sync` and `Harborline.Foundation.LocalFirst` This architecture shifts three fundamental habits. -**Writes are local first, propagated second.** In conventional SaaS, a write succeeds when the server acknowledges it. In this model, a write succeeds when it lands in the local store. Sync is asynchronous and non-blocking. Command handlers succeed on local durability, not remote confirmation. Every state mutation must be expressed as a CRDT operation that can be merged with concurrent mutations from other nodes - operations rather than current-state assignments. This discipline is the fundamental shift. +**Writes are local first, propagated second.** In conventional SaaS, a write succeeds when the server acknowledges it. In this model, a write succeeds when it lands in the local store. Sync is asynchronous and non-blocking. Every state mutation must be expressed as a CRDT operation that can be merged with concurrent mutations from other nodes — operations rather than current-state assignments. This discipline is the fundamental shift. -**Business logic owns its correctness independently of the network.** The application logic layer has no implicit network-call path. Every validation, every invariant, every state machine transition runs against local data. Logic that depends on globally consistent current state belongs in the CP-class record category, coordinated through distributed leases. Logic that treats a network call as a validation shortcut fails when the network is absent - which means it fails in the field. +**Business logic owns its correctness independently of the network.** The application logic layer has no implicit network-call path. Every validation, every invariant, every state machine transition runs against local data. Logic that depends on globally consistent current state belongs in the CP-class record category, coordinated through distributed leases. Logic that treats a network call as a validation shortcut fails when the network is absent. -**Failure modes are explicit.** An AP-class write always succeeds locally. A CP-class write either acquires a lease or surfaces a clear constraint. A sync conflict surfaces in the conflict inbox, not as a silent overwrite. The system's failure modes are designed to be visible. The developer's job is to wire those signals to the UI correctly, not to paper over them. +**Failure modes are explicit.** An AP-class write always succeeds locally. A CP-class write either acquires a lease or surfaces a clear constraint. A sync conflict surfaces in the conflict inbox, not as a silent overwrite. The developer's job is to wire those signals to the UI correctly, not to paper over them. The five layers in one diagram are the picture Part II will adversarially test. Everything that follows is detail. diff --git a/vol-1/part-2-council-reads-the-paper/ch09-local-first-practitioner-lens.md b/vol-1/part-2-council-reads-the-paper/ch09-local-first-practitioner-lens.md index 6bb46f8..e4f325d 100644 --- a/vol-1/part-2-council-reads-the-paper/ch09-local-first-practitioner-lens.md +++ b/vol-1/part-2-council-reads-the-paper/ch09-local-first-practitioner-lens.md @@ -1,22 +1,22 @@ # Chapter 9 - The Local-First Practitioner Lens - + --- -Tomás Ferreira held the local-first practitioner seat on Joel's dissertation committee - not as a theorist, but as someone who had already lived through the failure modes Joel's architecture was trying to prevent. +Tomás Ferreira held the local-first practitioner seat on Joel's dissertation committee — not as a theorist, but as someone who had already lived through the failure modes Joel's architecture was trying to prevent. ## Who Is Tomás Ferreira -Tomás Ferreira has shipped code to the Automerge ([github.com/automerge/automerge](https://github.com/automerge/automerge), a JSON-like CRDT (Conflict-free Replicated Data Type) library) repository for three years. Before that, he built a production local-first application for a small legal firm - document collaboration, no server required - and watched the users try to restore their data from a broken laptop. The restore took four hours because the backup was a folder on Dropbox that had not synced correctly. He sat in the room while it ran. He built a second application after that, this time with a proper backup strategy, and watched a different user delete their container by accident and learn what "no backup" actually means. That is the kind of discovery that does not leave you. +Ferreira has shipped code to the Automerge repository for three years. Before that, he built a production local-first application for a small legal firm — document collaboration, no server required — and watched users try to restore data from a broken laptop. The restore took four hours because the backup was a Dropbox folder that had not synced correctly. He sat in the room while it ran. He built a second application with a proper backup strategy, then watched a different user delete their container by accident and learn what "no backup" actually means. That is the kind of discovery that does not leave you. -He is not idealistic about local-first software. He is familiar with where it breaks. +He is not idealistic about local-first software. He knows where it breaks. -His lens on any local-first architecture proposal is not whether it upholds the principles. The principles are table stakes. His questions are operational. What happens when the user's only device dies? What happens when both peers are behind carrier-grade NAT and the relay is down? What happens when a user wants to leave and take their data somewhere else? Ferreira has sat across from non-technical users who lived through all three of these scenarios. He knows what their faces look like. +His lens on any local-first architecture proposal is not whether it upholds the principles — those are table stakes. His questions are operational. What happens when the user's only device dies? What happens when both peers are behind carrier-grade NAT and the relay is down? What happens when a user wants to leave and take their data somewhere else? Ferreira has sat across from non-technical users who lived through all three of these scenarios. He knows what their faces look like. -When Ferreira reviewed Joel's dissertation in Round 1, he brought that operational history into the room with him. He commended the places the dissertation got right. He blocked it on the place it got exactly wrong. +When Ferreira reviewed Joel's dissertation in Round 1, he brought that operational history into the room. He commended the places the dissertation got right. He blocked it on the place it got exactly wrong. --- @@ -24,57 +24,53 @@ When Ferreira reviewed Joel's dissertation in Round 1, he brought that operation ### What Ferreira Commended -Ferreira does not commend things he does not mean. His Round 1 scorecard opened at 9 out of 10 for CRDT (Conflict-free Replicated Data Type) library selection. Yjs ([github.com/yjs/yjs](https://github.com/yjs/yjs), the JavaScript CRDT library) for JavaScript environments and Loro ([github.com/loro-dev/loro](https://github.com/loro-dev/loro), a Rust-core CRDT library) for Rust-native performance are both correct choices, well-suited to their respective contexts. The three-tier resolution model - when to apply CRDT merge versus user-arbitrated resolution - is the most honest treatment of CRDT applicability he had encountered in any architecture proposal outside published academic literature. +Ferreira's Round 1 scorecard opened at 9 out of 10 for CRDT library selection. Yjs for JavaScript environments and Loro for Rust-native performance are both correct choices, well-suited to their respective contexts. The three-tier resolution model — when to apply CRDT merge versus user-arbitrated resolution — is the most honest treatment of CRDT applicability he had encountered in any architecture proposal outside published academic literature. -The multi-device onboarding flow - install, scan a QR code, sync in the background - addressed the bootstrapping problem that breaks most naive local-first architectures. The usual failure mode is a chicken-and-egg: to join a workspace, the new device needs credentials, but credentials require an existing peer, and a peer requires the network to be available at exactly the right moment. The QR-based attestation bundle transfers everything the new node needs to authenticate and begin gossip in a single out-of-band step. That is the right design. +The multi-device onboarding flow — install, scan a QR code, sync in the background — solved the bootstrapping problem that breaks most naive local-first architectures. The usual failure mode is chicken-and-egg: joining a workspace requires credentials, credentials require an existing peer, and a peer requires the network at exactly the right moment. The QR-based attestation bundle transfers everything the new node needs to authenticate and begin gossip in a single out-of-band step. -He also commended the container cold-start solution. One of the places local-first desktop applications fall apart is the delay between launch and ready for data. A Podman container starting from scratch on first open creates a pause that signals to users that something is wrong - that the software is not, in fact, running locally, but is somehow waiting for something remote. The architecture's answer - a persistent background service that keeps the container running, fronted by a health-check gate that holds the UI until the daemon is ready - is the right call. It hides the implementation detail without deceiving the user. Hiding without lying is craft. +He also commended the container cold-start solution. A Podman container starting from scratch on first open creates a pause that signals something is wrong — that the software is not, in fact, running locally, but waiting for something remote. The architecture's answer — a persistent background service that keeps the container running, fronted by a health-check gate that holds the UI until the daemon is ready — hides the implementation detail without deceiving the user. Hiding without lying is craft. -Alignment with the Kleppmann et al. local-first ideals [1] scored an 8 out of 10. The dissertation understood the ideals and implemented most of them faithfully. The Ink and Switch essays on Pushpin and Backchat were not cited, which left it vulnerable to community criticism. The local-first community notices when practitioners ignore prior art. That is a condition, not a block. - -Community governance scored a 5 out of 10. An MIT or Apache 2 license is stated. Who controls the roadmap, who approves breaking changes, what the contribution model looks like - none of that is specified. The local-first community has watched too many promising projects fork or stall because governance was not designed before it was needed. That is a condition too. +Alignment with the Kleppmann et al. local-first ideals [1] scored an 8 out of 10. The dissertation understood the ideals and implemented most of them faithfully. The Ink and Switch essays on Pushpin and Backchat were not cited, which left it vulnerable to community criticism. Community governance scored a 5 out of 10: an MIT or Apache 2 license is stated, but who controls the roadmap, who approves breaking changes, and what the contribution model looks like — none of it specified. Both were conditions, not blocks. Then Ferreira got to data portability. ### The Blocking Issue: No Export Path -The paper's thesis is data ownership. The paper's proof is the local node: because the data lives on the user's machine, in a local encrypted database, the user owns it in a structural sense. The vendor cannot take it away. The SaaS (Software as a Service) subscription cannot gate access to it. The architecture enforces the ownership as a design invariant, not a contractual promise. +The paper's thesis is data ownership. The proof is the local node: because data lives on the user's machine in a local encrypted database, the user owns it structurally. The vendor cannot take it away. A SaaS subscription cannot gate access to it. Ferreira agreed with the thesis and blocked on the execution. -If a user wants to leave the application - switch to a different tool, transfer their data to a new platform, or simply preserve their records in a format still readable in twenty years - how do they do it? The paper specified the backup target: rclone with a user-controlled object storage account. But rclone backup preserves the internal data format. It does not export the data in a form any other application can read. It gives no JSON file of their records, no CSV of their tabular data, no folder of Markdown documents. It gives them a copy of the encrypted local database, readable only by the application that created it. - -Ferreira named the philosophical contradiction precisely. A paper arguing for data ownership that does not specify how a user exports their data in a durable, application-independent format does not actually deliver data ownership. It delivers data custody under slightly better conditions. Custody is the lesser word. The architecture had to learn the difference. +If a user wants to leave the application — switch tools, transfer data to a new platform, or preserve records in a format still readable in twenty years — how do they do it? The paper specified the backup target: rclone with a user-controlled object storage account. But rclone backup preserves the internal data format. It does not export records in a form any other application can read. It gives no JSON file, no CSV of tabular data, no folder of Markdown documents. It gives a copy of the encrypted local database, readable only by the application that created it. -The difference matters. A user who wants to move from this architecture to a different tool needs access to their data in a form the new tool can ingest. An application-independent export format - JSON for structured records, CSV for tabular data, Markdown for documents - satisfies that requirement. An internal backup format does not. +Ferreira named the philosophical contradiction precisely: a paper arguing for data ownership that does not specify how a user exports their data in a durable, application-independent format does not deliver data ownership. It delivers data custody under slightly better conditions. Custody is the lesser word. -He also flagged the non-technical disaster recovery path as a condition rather than a second block. The architecture specified rclone backup to user-controlled object storage - correct - but never walked through what a non-technical user actually does when their laptop dies and they need to restore. Step one: buy a new laptop. Step two: install the application. Step three: what? The architecture knew the answer but never wrote it down. +He also flagged the non-technical disaster recovery path as a condition. The architecture specified rclone backup to user-controlled object storage — correct — but never walked through what a non-technical user does when their laptop dies and they need to restore. Step one: buy a new laptop. Step two: install the application. Step three: what? The architecture knew the answer but never wrote it down. -The symmetric NAT scenario was a third condition. When two peers are both behind carrier-grade NAT and the relay is unavailable, direct communication is impossible. The paper's peer discovery section described mDNS for LAN discovery and relay for WAN - but it did not acknowledge that carrier-grade NAT can defeat both if the relay is down. The failure mode exists. The paper did not document it. +The symmetric NAT scenario was a third condition. When two peers are both behind carrier-grade NAT and the relay is unavailable, direct communication is impossible. The paper described mDNS for LAN discovery and relay for WAN — but did not acknowledge that carrier-grade NAT can defeat both if the relay is down. The failure mode exists. The paper did not document it. ### Round 1 Verdict: PROCEED WITH CONDITIONS -Ferreira's domain average for Round 1 was 7.0 out of 10. He issued PROCEED WITH CONDITIONS - three items, with the absent export path as the heaviest of them. +Ferreira's domain average was 7.0 out of 10. He issued PROCEED WITH CONDITIONS — three items, with the absent export path as the heaviest. -His verdict rationale was direct. Clear is kind. The paper cannot argue for data ownership and omit the export button. Until the architecture specifies how a user retrieves their data in a form that does not require the original application to read it, the ownership claim is hollow. The underlying architecture is better than most. The specific gap is the most important one. The condition would block alpha implementation if it were not addressed before the second review - practitioner-strict, not procedurally formal. +His rationale was direct: the paper cannot argue for data ownership and omit the export button. Until the architecture specifies how a user retrieves their data in a form that does not require the original application to read it, the ownership claim is hollow. The condition would block alpha implementation if unaddressed before the second review. -The paper returned to the author with the data portability issue as a blocking item alongside five others: two from Shevchenko (CRDT GC and Flease split-write), two from Kelsey (no customer archetype and no conversion mechanism), and one from Okonkwo (key compromise response). Shevchenko and Kelsey issued formal BLOCK verdicts. Voss and Okonkwo issued PROCEED WITH CONDITIONS while naming prerequisite items. The revision had to address all six blocking items before any member would begin a second review. +The paper returned with Ferreira's data portability issue alongside five others: two from Shevchenko (CRDT GC and Flease split-write), two from Kelsey (no customer archetype and no conversion mechanism), and one from Okonkwo (key compromise response). Shevchenko and Kelsey issued formal BLOCK verdicts. The revision had to address all six before any member would begin a second review. --- ## What Changed Between Rounds -Four months passed between the Round 1 verdict and the Round 2 submission. Joel addressed all six blocking issues in the dissertation's second version. Ferreira's three Round 1 conditions and his single blocking issue all received direct treatment. +Four months passed between the Round 1 verdict and the Round 2 submission. Joel addressed all six blocking issues. -The export path is now specified. One command produces a directory with three artifacts: a JSON file containing all user records in application-independent structure, a set of CSV files for every tabular data type, and a folder of Markdown documents for long-form content. No vendor cooperation required. No active subscription required. No internet connection needed. The command runs against the local node, reads from the local encrypted database, and writes to a path the user specifies. Any application that can ingest JSON or CSV can ingest the output. +The export path is now specified. One command produces a directory with three artifacts: a JSON file containing all user records in application-independent structure, CSV files for every tabular data type, and a folder of Markdown documents for long-form content. No vendor cooperation required. No active subscription required. No internet connection needed. Any application that can ingest JSON or CSV can ingest the output. -The non-technical disaster recovery walkthrough now exists step by step. The scenario is a laptop destroyed beyond recovery - hardware failure, theft, fire, or routine power loss during a load-shedding cycle that interrupts an in-progress write. Step one: acquire a new laptop. Step two: install the application, which installs the container runtime and stack silently. Step three: when prompted, the user enters a recovery code generated during initial setup, or scans a QR code from a team member's device. Step four: the application prompts for the BYOC (Bring Your Own Cloud) backup target - the Backblaze bucket, the S3 path, the rclone destination configured during original setup. Step five: full restore runs in the background. The user can work immediately on data their role includes in eager sync buckets; remaining records populate as the background sync completes. No technical knowledge required at any step. The recovery code or team-member QR scan substitutes for the original device's attestation bundle. +The non-technical disaster recovery walkthrough now exists step by step. The scenario is a laptop destroyed beyond recovery. Step one: acquire a new laptop. Step two: install the application, which installs the container runtime and stack silently. Step three: enter a recovery code generated during initial setup, or scan a QR code from a team member's device. Step four: enter the BYOC backup target — the Backblaze bucket, the S3 path, the rclone destination configured during original setup. Step five: full restore runs in the background. The user can work immediately on data in eager sync buckets; remaining records populate as background sync completes. No technical knowledge required at any step. -The walkthrough has a shared-device variant for the deployment models common in GCC, Indian BFSI, and African field operations: a single tablet rotated across a team of field workers, where the device belongs to the workspace and the recovery target is the role, not the user's individual identity. Step one is the same - acquire a replacement tablet. Step two installs the application as a managed deployment via the organization's MDM (Mobile Device Management) profile rather than user-initiated install. Step three authenticates the team's role rather than an individual identity; the device picks up the role's attestation bundle from the relay or a peer device on the team. Step four restores the workspace's data from the role-scoped BYOC backup target. Step five resumes operations: the next field worker who picks up the tablet logs in with their team credentials and finds the workspace ready. The recovery targets the role and the workspace, not the device and its sole user. This is what disaster recovery looks like for the deployments that need local-first the most. +The walkthrough adds a shared-device variant for deployment models common in GCC, Indian BFSI, and African field operations: a single tablet rotated across a team of field workers, where the device belongs to the workspace and recovery targets the role, not any individual identity. Step three authenticates the team's role rather than a personal identity; the device picks up the role's attestation bundle from the relay or a peer device. Step four restores from the role-scoped BYOC backup target. This is what disaster recovery looks like for the deployments that need local-first the most. -The symmetric NAT failure mode is now documented honestly. When both peers are behind carrier-grade NAT and the relay is down, direct peer-to-peer communication is impossible. The paper does not paper over this. It names the condition: carrier-grade NAT plus relay outage produces local-only mode for both parties. The relay is the resolution - when the relay is up, it handles NAT traversal. For organizations where relay availability is itself a concern, self-hosting a relay instance on a cloud VM with a public IP provides a fallback that eliminates the symmetric NAT problem by giving both peers a fixed reachable endpoint. +The symmetric NAT failure mode is now documented honestly. When both peers are behind carrier-grade NAT and the relay is down, direct peer-to-peer communication is impossible. The paper names the condition rather than papering over it. For organizations where relay availability is itself a concern, self-hosting a relay instance on a cloud VM with a public IP eliminates the symmetric NAT problem by giving both peers a fixed reachable endpoint. -Community governance now has a three-stage model: the author serves as BDFL for version 1, making unilateral architectural decisions with a published rationale for each breaking change. A contributor council of five elected members takes over for version 2, with a defined voting process for protocol changes. A foundation structure for version 3, modeled on established open-source infrastructure projects, provides independent governance once the community has grown enough to sustain it. +Community governance now has a three-stage model: the author serves as BDFL for version 1, a contributor council of five elected members takes over for version 2, and a foundation structure for version 3 provides independent governance once the community has grown enough to sustain it. --- @@ -84,19 +80,17 @@ Community governance now has a three-stage model: the author serves as BDFL for Ferreira opened his Round 2 review by applying the Kleppmann et al. checklist [1] directly. He had done this in Round 1 and found gaps. -The seven local-first ideals, checked against the revised architecture: - -**No spinners, no waiting.** The local node holds the authoritative data copy. The UI reads from local storage. There are no round-trips to a remote server for reads. Instant. Holds. +**No spinners, no waiting.** The local node holds the authoritative data copy. The UI reads from local storage. No round-trips to a remote server for reads. Holds. -**Your work is not trapped on one device.** CRDT sync across peers ensures that data written on one device propagates to all authorized peers. The gossip daemon handles the distribution. Confirmed. +**Your work is not trapped on one device.** CRDT sync across peers ensures data written on one device propagates to all authorized peers. The gossip daemon handles distribution. Confirmed. **The network is optional.** The architecture is explicitly offline-first. The node operates at full fidelity without network connectivity. Degraded UX modes apply only to CP-class data requiring freshness guarantees. Settled. -**Seamless collaboration with your colleagues.** CRDT merge handles concurrent edits without coordination. The conflict inbox and bulk resolution UX surfaces the edge cases that require human judgment. Checked architecturally - with a practitioner's honesty note: the reference implementation's CRDT backend integration (YDotNet (the .NET CRDT engine port of Yjs via Rust FFI (Foreign Function Interface)) replacing the current stub) is the open work that will let this check mark move from architectural commitment to field-proven behavior. Ferreira has written enough CRDT code to know that the distinction matters. +**Seamless collaboration with your colleagues.** CRDT merge handles concurrent edits without coordination. The conflict inbox and bulk resolution UX surfaces the edge cases that require human judgment. Checked architecturally — with a practitioner's honesty note: the reference implementation's CRDT backend integration is the open work that will let this check mark move from architectural commitment to field-proven behavior. Ferreira has written enough CRDT code to know the distinction matters. -**The long now.** The compliance CRDT tier with no garbage collection preserves the complete operation history. Records in this tier cannot be lost to compaction. Long-term archival formats are addressed. Met. +**The long now.** The compliance CRDT tier with no garbage collection preserves the complete operation history. Records in this tier cannot be lost to compaction. Met. -**Security and privacy by default.** Subscription scoping at the sync daemon layer enforces data minimization at the protocol layer - nodes receive only the data their role authorizes. End-to-end encryption means the relay handles ciphertext only. Checked. +**Security and privacy by default.** Subscription scoping at the sync daemon layer enforces data minimization at the protocol layer — nodes receive only the data their role authorizes. End-to-end encryption means the relay handles ciphertext only. Checked. **You retain ultimate ownership and control.** BYOC backup to user-controlled object storage. AGPLv3 license. Self-hostable relay. Plain-file export. The ownership is structural, not contractual, and the export path now makes it operationally real. Checked. @@ -106,80 +100,76 @@ This was the first version of the dissertation where Ferreira could work through With the blocking issue cleared and the checklist passed, Ferreira turned to what the revision had not addressed. -The UX section of the second paper substantially improved on Round 1. The sync status indicator design is correct - three persistent but unobtrusive indicators in the status bar, escalating from silent to informative to persistent-banner as conditions degrade. The conflict resolution UX addresses the most common usability failure in collaborative local-first applications: the overwhelming, undifferentiated list of conflicts that most systems present when two offline nodes reconnect. Grouping by record type and cause, auto-resolving the cases where predefined rules clearly apply, and offering resolve-all-similar for everything else brings the conflict inbox from anxiety-inducing to manageable. +The UX section substantially improved on Round 1. The sync status indicator design is correct — three persistent but unobtrusive indicators in the status bar, escalating from silent to informative to persistent-banner as conditions degrade. The conflict resolution UX addresses the most common usability failure in collaborative local-first applications: the overwhelming, undifferentiated list of conflicts that most systems surface when two offline nodes reconnect. Grouping by record type and cause, auto-resolving the cases where predefined rules clearly apply, and offering resolve-all-similar for everything else brings the conflict inbox from anxiety-inducing to manageable. The gap Ferreira identified as the primary 30-day abandonment risk is the zero-state first-run experience: what a brand-new user sees after installation, with no prior data and no existing peers. -The paper describes multi-device pairing - what happens when a user adds a second device to an existing workspace. It describes team onboarding - what happens when a user joins an existing team. What it does not describe is what a single user, installing the application for the first time, with no prior data and no colleague to scan a QR code from, actually sees. +The paper describes multi-device pairing and team onboarding. It does not describe what a single user, installing for the first time with no prior data and no colleague nearby, actually sees. -This is where most users leave. Not when the sync breaks. Not when a conflict surfaces. At the beginning, when the screen is empty and the application has no obvious first action. A brand-new user who installs the application, opens it, and sees a blank state without clear guidance for the first thirty seconds is a user who closes it. Not because the architecture failed. Because the first-run experience gave them no foothold. Ferreira has watched this happen with promising local-first software repeatedly. The architecture is sound. The opening screen is a dead end. +This is where most users leave — not when sync breaks, not when a conflict surfaces, but at the beginning, when the screen is empty and the application has no obvious first action. A brand-new user who opens the application and sees a blank state without guidance for the first thirty seconds is a user who closes it. Not because the architecture failed. Because the first-run experience gave them no foothold. Ferreira has watched this happen with promising local-first software repeatedly. -The paper should specify the zero-state experience explicitly: what the user sees, what action they take, how the application walks them from empty state to first project, first backup configuration, first invite. This is a product question, not an architecture question. But local-first architectures that do not answer it fail before the architecture gets a chance to prove itself. +The paper must specify the zero-state experience: what the user sees, what action they take, how the application walks them from empty state to first project, first backup configuration, first invite. This is a product question, not an architecture question. Local-first architectures that do not answer it fail before the architecture gets a chance to prove itself. ### Backup and Recovery UX -The backup status model in the revised paper is correct. Three states - Protected, Attention, At Risk - with escalating visual treatment: subtle for Protected (users do not need to think about a working backup), amber badge for Attention (something needs configuration), persistent non-blocking banner for At Risk (data is in danger, but the user retains the ability to dismiss with acknowledgment). The dismissal with explicit acknowledgment respects user agency without hiding the problem. +The backup status model in the revised paper is correct. Three states — Protected, Attention, At Risk — with escalating visual treatment: subtle for Protected, amber badge for Attention, persistent non-blocking banner for At Risk. The dismissal with explicit acknowledgment respects user agency without hiding the problem. -Ferreira's remaining gap: the dissertation describes the backup status display. It does not describe the recovery experience with comparable care. +The dissertation describes the backup status display. It does not describe the recovery experience with comparable care. -If a user's only device is destroyed and they initiate a restore from backup, what do they see? Does the application walk them through reconnecting to their backup target the same way it walked them through configuring it? Does the restore progress surface as a background sync indicator, using the same three-state model flipped into a restore context? Or does the user face a technical interface - a rclone path, a bucket URL - at exactly the moment when they are already stressed about lost work? +If a user's only device is destroyed and they initiate a restore from backup, what do they see? Does the application walk them through reconnecting to their backup target the same way it walked them through configuring it? Does restore progress surface as a background sync indicator, using the same three-state model flipped into a restore context? Or does the user face a technical interface — a rclone path, a bucket URL — at exactly the moment when they are already stressed about lost work? -The architecture's backup infrastructure is solid. The recovery UX needs the same design attention. *Your backup is protected* earns user trust only if the restore works without calling support. The paper should describe the restore flow step by step, with the same non-technical framing used for the disaster recovery walkthrough added in the revision. +The architecture's backup infrastructure is solid. The recovery UX needs the same design attention. *Your backup is protected* earns user trust only if the restore works without calling support. -### Production Analogues +### Production Analogues and CRDT Selection -The architecture's analogues table in the revised paper cites Figma ([figma.com](https://www.figma.com/), the design tool), Linear ([linear.app](https://linear.app/), the issue tracker), Obsidian, and PowerSync. These are correct references - they demonstrate that each subsystem of the inverted stack has production validation somewhere. But Ferreira, with Automerge and Ink & Switch adjacency, brings a practitioner's opinion on the CRDT library choice that the chapter should surface directly. +The architecture's analogues table cites Figma, Linear, Obsidian, and PowerSync — correct references demonstrating that each subsystem of the inverted stack has production validation somewhere. Ferreira, with Automerge and Ink and Switch adjacency, adds a practitioner's verdict on the CRDT library choice: pick Yjs via YDotNet today. Broadest production adoption, battle-tested merge semantics, documented wire format. Automerge 3.0 (2025) is now production-viable; Loro is the aspirational target once C# bindings mature. Zoho's offline-capable suite — hundreds of thousands of paying users in India and the GCC — is the regional analogue Western surveys miss. 1С:Предприятие is the closest CIS commercial analogue, with tens of millions of users on a desktop-software-with-optional-server model. The `ICrdtEngine` abstraction is the single best architectural decision the dissertation makes: it keeps the engine choice reversible while the field continues to evolve. -Pick Yjs via YDotNet today. Ferreira's verdict is unambiguous: broadest production adoption, battle-tested merge semantics, documented wire format. Automerge 3.0 (2025) is now production-viable where it was not before; Loro is the aspirational target once C# bindings mature. Zoho's offline-capable suite serves hundreds of thousands of paying users in India and the GCC and is the regional analogue Western surveys typically miss; 1С:Предприятие is the closest CIS commercial analogue with tens of millions of users on a desktop-software-with-optional-server model. Ferreira called the framework-agnostic `ICrdtEngine` abstraction the single best architectural decision the dissertation makes - it keeps the choice reversible while the field continues to evolve, and Ch12 specifies the engine survey at full depth. - -The architecture has a structural accessibility advantage no council chapter has yet named explicitly: assistive technology - screen readers, switch controls, voice input - operates against local data and survives connectivity loss without cascading failures. A user running JAWS, NVDA, or VoiceOver on intermittent connectivity in rural Bahia, rural Oaxaca, or a Lagos branch with daily load-shedding does not experience the application timeouts that cloud-dependent assistive technology produces. Sync status surfaces through ARIA live regions (`role="status"`, `aria-live="polite"`); recovery flows respect the same cognitive-accessibility framing the chapter applies to non-technical users. This is the strongest accessibility argument for local-first that any practitioner has put on paper, and it lives in this chapter because Ferreira's lens - *what the user actually experiences* - is the lens that makes accessibility legible as an architectural property rather than an afterthought. +One structural accessibility advantage no council chapter had yet named explicitly: assistive technology — screen readers, switch controls, voice input — operates against local data and survives connectivity loss without cascading failures. A user running JAWS, NVDA, or VoiceOver on intermittent connectivity does not experience the application timeouts that cloud-dependent assistive technology produces. Sync status surfaces through ARIA live regions; recovery flows respect the same cognitive-accessibility framing applied to non-technical users. This is the strongest accessibility argument for local-first that any practitioner has put on paper, and it lives here because Ferreira's operational lens — *what the user actually experiences* — is what makes accessibility legible as an architectural property rather than an afterthought. ### Implementation Drift Risk -Ferreira's final Round 2 observation is the one that will matter most in year two of production: the implementation drift problem. +Ferreira's final observation is the one that will matter most in year two of production: implementation drift. -The Kleppmann et al. paper [1] warns about this directly. Local-first architecture erodes under pressure. The erosion does not happen all at once. It happens one reasonable-sounding decision at a time. A team adds a server-side feature flag check. Then a server-side A/B test. Then product analytics to understand which features users use. Then a server-side model for something the local CRDT cannot handle efficiently. Each decision is defensible in isolation. Collectively, they re-centralize the architecture until the local node is a thick client again and the server is load-bearing. +The Kleppmann et al. paper [1] warns about this directly. Local-first architecture erodes under pressure — one reasonable-sounding decision at a time. A team adds a server-side feature flag check, then a server-side A/B test, then product analytics, then a server-side model for something the local CRDT cannot handle efficiently. Each decision is defensible in isolation. Collectively, they re-centralize the architecture until the local node is a thick client again and the server is load-bearing. -The paper addresses this for business logic - it explicitly prohibits feature gating via server-side checks - but leaves the analytics and observability layer unaddressed. +The paper addresses this for business logic — it explicitly prohibits feature gating via server-side checks — but leaves the analytics and observability layer unaddressed. -Modern product teams expect product analytics. They need to understand where users drop off, which features get used, what errors occur. In a local-first architecture, these signals cannot be collected server-side because there is no server-side session. The options are: opt-in telemetry that users explicitly enable; aggregate statistics piped through the relay, privacy-preserving and metadata-only; or no analytics at all. +Modern product teams expect product analytics. In a local-first architecture, those signals cannot be collected server-side because there is no server-side session. The options are: opt-in telemetry that users explicitly enable; aggregate statistics piped through the relay, privacy-preserving and metadata-only; or no analytics at all. -The paper must specify which model it adopts and why. The choice is not architecturally complex. Leaving it unspecified guarantees that the first product manager who wants a funnel report will add a server-side analytics endpoint as a quick addition - the first stone on the re-centralization path. The reference implementation adopts opt-in telemetry, disabled by default, with aggregate-through-relay privacy-preserving statistics as the only permitted centralized data collection - mapped to General Data Protection Regulation (GDPR) Article 25 privacy-by-design and consent as the lawful basis. Naming the choice is the governance control that makes the line durable; an ADR (Architecture Decision Record) documenting the decision makes it defensible under pressure from future product analytics requests. +The paper must specify which model it adopts. Leaving it unspecified guarantees that the first product manager who wants a funnel report will add a server-side analytics endpoint — the first stone on the re-centralization path. The reference implementation adopts opt-in telemetry disabled by default, with aggregate-through-relay statistics as the only permitted centralized collection, mapped to GDPR Article 25 privacy-by-design. An ADR documenting the decision makes it defensible under future pressure. ### Round 2 Verdict: PROCEED -Ferreira issued PROCEED in Round 2. No conditions required. No blocking issues. +Ferreira issued PROCEED in Round 2. No conditions. No blocking issues. -His four observations - the zero-state first-run gap, the recovery UX, the Actual Budget omission, and the telemetry model - carried specific recommendations, not conditions. He filed them as non-blocking guidance, not as gates on implementation. +His four observations — the zero-state first-run gap, the recovery UX, the backup UX parity gap, and the telemetry model — carried specific recommendations, not conditions. He filed them as non-blocking guidance, not as gates on implementation. -This matters for two reasons. Practically: the architecture proceeds to alpha implementation without resolving these items. Structurally: Ferreira is the first council member, across both rounds, to issue an unconditional PROCEED. The enterprise architect issued PROCEED WITH CONDITIONS. The distributed systems researcher issued PROCEED WITH CONDITIONS. The security practitioner issued PROCEED WITH CONDITIONS. The product manager issued PROCEED WITH CONDITIONS. Ferreira, the practitioner who knows where the bodies are buried, looked at the revised architecture and found nothing that blocked it. +Practically: the architecture proceeds to alpha implementation without resolving these items. Structurally: Ferreira is the first council member across both rounds to issue an unconditional PROCEED. The enterprise architect issued PROCEED WITH CONDITIONS. The distributed systems researcher issued PROCEED WITH CONDITIONS. The security practitioner issued PROCEED WITH CONDITIONS. The product manager issued PROCEED WITH CONDITIONS. Ferreira, the practitioner who knows where the bodies are buried, looked at the revised architecture and found nothing that blocked it. That verdict is not a formality. It is the hardest one to earn. --- -## Global Deployment Context - Ferreira's Empirical Note +## Global Deployment Context -Ferreira's unconditional PROCEED is defensible as an architectural verdict. It is also calibrated against empirical evidence the other council members had to assert. In 2022, Adobe, Autodesk, Microsoft, Figma, and dozens of other Western SaaS vendors suspended service across Russia and CIS (Commonwealth of Independent States) markets under sanctions enforcement - hundreds of thousands of organizations lost access with days of notice. That event is the practitioner's evidence for why unconditional PROCEED is not a generosity. An architecture that survives vendor suspension is not a theoretical improvement. It is the architecture that already proved necessary once. +Ferreira's unconditional PROCEED is calibrated against empirical evidence the other council members had to assert. In 2022, Adobe, Autodesk, Microsoft, Figma, and dozens of other Western SaaS vendors suspended service across Russia and CIS markets under sanctions enforcement — hundreds of thousands of organizations lost access with days of notice. An architecture that survives vendor suspension is not a theoretical improvement. It is the architecture that already proved necessary once. -The local-first-is-legally-required regulatory envelope extends well past the General Data Protection Regulation (GDPR). The 2020 Schrems II ruling - *Data Protection Commissioner v. Facebook Ireland Limited*, CJEU Case C-311/18, a Court of Justice ruling distinct from GDPR as regulation - constrains transfers of EU personal data to US cloud providers without adequate supplemental safeguards, enforced nationally by Germany's BSI and France's CNIL. The EU's NIS2 Directive (Article 21 risk-management measures, in force October 2024) and the EU Cyber Resilience Act add cybersecurity-by-design obligations that local-first deployments satisfy structurally. India's DPDP Act 2023 and RBI 2018 BFSI data localization circular make local-first compliance for financial data; UAE's DIFC Data Protection Law 2020 and ADGM Data Protection Regulations 2021 may legally prohibit foreign cloud storage for free-zone-licensed firms; Saudi Arabia's PDPL 2021 and parallel Kuwait/Qatar/Bahrain regimes extend the GCC envelope. East Asia: Japan's APPI (2022 revision), South Korea's PIPA + ISMS-P, China's PIPL and MLPS 2.0. Africa: Nigeria's NDPR (re-enacted 2023), South Africa's POPIA, Kenya's Data Protection Act 2019. Latin America: Brazil's LGPD, Mexico's LFPDPPP, Colombia's Ley 1581. CIS: Russia's Federal Law 242-FZ (predating GDPR by two years), Kazakhstan and Belarus parallel localization regimes, plus import substitution (импортозамещение) policy that makes local-first architecture a natural compliance target for public sector and critical infrastructure deployments. The full coverage matrix sits in Appendix F. In each jurisdiction, the architecture where data lives on the user's own hardware is the architecture that makes compliance tractable. +The regulatory envelope extends well past GDPR. The 2020 Schrems II ruling constrains transfers of EU personal data to US cloud providers without adequate supplemental safeguards. India's DPDP Act 2023 and RBI 2018 BFSI localization circular make local-first compliance mandatory for financial data. UAE's DIFC and ADGM data protection rules may legally prohibit foreign cloud storage for free-zone-licensed firms. Saudi Arabia's PDPL 2021, Japan's APPI (2022 revision), South Korea's PIPA, China's PIPL, Brazil's LGPD, and Russia's Federal Law 242-FZ each impose localization or processing constraints that local-first deployments satisfy structurally. The full coverage matrix sits in Appendix F. In each jurisdiction, the architecture where data lives on the user's own hardware is the architecture that makes compliance tractable. -Intermittent connectivity is the operational baseline for hundreds of millions of enterprise workers across Sub-Saharan Africa, South and Southeast Asia, and rural Latin America - not a carrier-grade NAT edge case. The disaster recovery walkthrough must therefore address shared-device deployments, which are the norm in African and South Asian enterprise field operations - a single tablet rotated across a team of field workers, where recovery targets the role and the workspace, not the device and its sole user. BYOC backup to role-scoped workspace targets answers this scenario; CRDT subscription scoping at the sync daemon answers the intermittent-connectivity baseline; and local key management - where keys never leave the user's device - answers the state-mandated compelled-access threat model that CIS deployment contexts face as a first-order consideration. The architecture answers these conditions structurally, not contractually. The practitioner's unconditional PROCEED is easier to issue because the markets where the architecture is most needed are also the markets where alternatives have already failed. +Intermittent connectivity is the operational baseline for hundreds of millions of enterprise workers across Sub-Saharan Africa, South and Southeast Asia, and rural Latin America — not a carrier-grade NAT edge case. CRDT subscription scoping at the sync daemon handles the intermittent-connectivity baseline. Local key management answers the state-mandated compelled-access threat model that CIS deployment contexts face as a first-order consideration. The architecture answers these conditions structurally, not contractually. The practitioner's unconditional PROCEED is easier to issue because the markets where the architecture is most needed are also the markets where alternatives have already failed. --- ## The Non-Negotiable Practitioner Checklist -What a practitioner carries forward from Ferreira's review: - -- **Export path is a first-class shipping requirement, not a future feature.** JSON, CSV, or Markdown - durable, application-independent formats. The export button is the proof that the ownership claim is real, not contractual. +- **Export path is a first-class shipping requirement, not a future feature.** JSON, CSV, or Markdown — durable, application-independent formats. The export button is the proof that the ownership claim is real. - **Disaster recovery walkthrough ships with the product.** A non-technical user, after complete device failure, must restore from backup in under thirty minutes. Specify for single-device and shared-device deployments. A backup that cannot be restored is not a backup. -- **Telemetry model is decided before the first product-analytics request.** Opt-in telemetry, aggregate-through-relay privacy-preserving statistics, or no analytics at all. Naming the model is the control that prevents implementation drift toward server-side re-centralization. -- **Zero-state first-run experience is specified as a product requirement.** What a new user sees at the blank screen, the first action the application guides them to, the path from empty state to first project, first backup, first invite. This is where most local-first products lose users. -- **Recovery UX receives the same design attention as backup status.** Non-technical restore flow, progress indication, no rclone paths at the moment the user is already stressed about lost work. -- **CRDT engine choice is kept reversible behind a stable abstraction.** `ICrdtEngine` or equivalent. The field is still evolving; Yjs today, Automerge or Loro tomorrow, without rewriting the application layer. +- **Telemetry model is decided before the first product-analytics request.** Opt-in telemetry, aggregate-through-relay privacy-preserving statistics, or no analytics at all. Naming the model prevents implementation drift toward server-side re-centralization. +- **Zero-state first-run experience is specified as a product requirement.** What a new user sees at the blank screen, the first action the application guides them to, the path from empty state to first project, first backup, first invite. +- **Recovery UX receives the same design attention as backup status.** Non-technical restore flow, progress indication, no rclone paths at the moment the user is already stressed. +- **CRDT engine choice is kept reversible behind a stable abstraction.** `ICrdtEngine` or equivalent. Yjs today, Automerge or Loro tomorrow, without rewriting the application layer. - **Honesty about offline-only failure modes is non-negotiable.** Symmetric NAT plus relay outage is one example; extended partition beyond the GC horizon is another. Name them, document the fallback, resist the temptation to pretend they cannot occur. -- **Global deployment context is part of the product specification.** Load-shedding durability, shared-device recovery, non-GDPR regulatory envelopes, intermittent-connectivity as operational baseline - these are product requirements for the markets where local-first is most valuable, not features for a future release. +- **Global deployment context is part of the product specification.** Load-shedding durability, shared-device recovery, non-GDPR regulatory envelopes, intermittent connectivity as operational baseline — these are product requirements for the markets where local-first is most valuable. --- @@ -187,17 +177,15 @@ What a practitioner carries forward from Ferreira's review: Ferreira's Round 1 block reduced to a single principle: you cannot claim to give users ownership of their data if you do not give them a way to take it somewhere else. -The local node solves the access problem. Data on the user's machine is accessible when vendor servers are down. Data in the user's encrypted local database cannot be held hostage by a subscription paywall. The architecture eliminates the dependency on vendor infrastructure for the thing users care about most: getting to their own work. - -But access is not portability. A user who wants to move to a different application, or preserve their data in a format that does not require this specific software to read, needs more than local storage. They need an export. JSON, CSV, Markdown - durable, application-independent formats that any competent software can ingest. +The local node solves the access problem. Data on the user's machine is accessible when vendor servers are down; data in the local encrypted database cannot be held hostage by a subscription paywall. But access is not portability. A user who wants to move to a different application needs an export — JSON, CSV, Markdown — in formats that any competent software can ingest. The export button is not a nice-to-have. It is the proof of the claim. -The same logic extends to disaster recovery. *Your data is backed up* is not sufficient. *Your data can be restored by a non-technical user in under thirty minutes after a complete device failure* is the claim that actually serves users. The architecture must describe the recovery path with the same care it describes the backup configuration. A backup that cannot be restored is not a backup. It is a simulation of safety. +The same logic extends to disaster recovery. *Your data is backed up* is not sufficient. *Your data can be restored by a non-technical user in under thirty minutes after complete device failure* is the claim that actually serves users. A backup that cannot be restored is not a backup. It is a simulation of safety. -The symmetric NAT failure mode is a concrete example of the honesty standard that separates production local-first software from demos. Every architecture has connectivity scenarios it cannot handle. The question is whether it names them or hides them. Carrier-grade NAT plus relay outage produces a failure mode where two peers cannot communicate. Claiming the relay is so reliable this scenario never occurs is wrong. Document the failure mode and describe the fallback: a self-hosted relay on a machine with a public IP removes the symmetric NAT problem entirely, at the cost of the infrastructure burden the managed relay was designed to eliminate. +Honesty about failure modes is what separates production local-first software from a persuasive demo. Every architecture has connectivity scenarios it cannot handle. Carrier-grade NAT plus relay outage produces a failure mode where two peers cannot communicate — and documenting it, with the self-hosted relay as the fallback, is the work that earns the architecture production credibility. -Honesty about failure modes is what distinguishes production local-first software from a persuasive demo. Ferreira has shipped production local-first software. He recognized the difference. He PROCEED'd when he saw it. +Ferreira has shipped production local-first software. He recognized the difference. He PROCEED'd when he saw it. ---