Recent Articles

Mon May 10 2021 09:21:51 GMT+0000 (Coordinated Universal Time)

Firestore's Technical Advantages

I have read some pretty poor articles bashing Firestore recently. Generally they completely miss the features set or cargo cult Postgres SQL. This article attempt to highlight the features of Firestore that you won't see with a Postgres solution (note I love Postgres), highlighting several area's where Firestore is the world #1.

Clientside first

It's designed for a direct connection to a mobile/webapp. This means it has a number of features that are unmatched in the market.

Latency compensation

Firestore maintains a local cache, so local writes and observable immediately. Greatly simplifying controller design. It even broadcasts writes to adjacent tabs in the browser for 0 latency across browser tabs. You ain't got time to implement that!

Offline persistence

The cache is backed by persistent storage, so your app works offline without much work. This is a huge feature that is difficult to get right and essential for a good mobile experience.

Authorisation rules

The database has a layer of security rules which are very flexible and can depend on values in the database.

Causal Consistency

Client SDKs observe their own writes first, and remote writes sometime later. Firestore guarantees that remote writes preserve their order. This is better than eventual consistency, its causal consistency and the best you can manage in a distributed setting. The fact write order is preserved makes the system very intuitive, but many Firestore competitors do not guarantee this property which leads to weird bugs and second guessing.

1M concurrent clients

Its not so easy to support 1M concurrent connections, that's serious engineering work.

Spanner Backed

Firestore somewhat outclasses Postgres on underlying database technology too, being based on Google Spanner. Firestore is the most affordable way to access a Google Spanner based database.

99.999% SLA

Yes. You probably can't find a more reliable cross region database.

Multi-region yet strong consistency

Writes are replicated across multiple regions. This is one of the reasons why it is so reliable, it is resistant to single data centre losses. It can achieve this AND still be strongly consistent. It is simply not possible to configure Postgres to be multi region and be strongly consistent. This is really what Spanner brings to the table.


Firestore can do atomic writes across documents, without caveats, without sharding. Very few distributed databases can achieve this in a multi-region setting.

array-contains-any joins

I have read that noSQL databases do not support joins at all. This is true for many NoSQL solutions, but not the full truth in Firestore's case. It is true query expressivity is lower than SQL.

However, thanks to the "array-contains-any" query you can retrieve a set of documents matching a set of ids in a single query. This is far more efficient than having to retrieve documents on the other side of a join one at a time. Thus a SQL with 3 joins can usually be performed with 3 queries in Firestore with the appropriate indexes. Though, to be fair Postgres has the upper hand here.


Firestore is true serverless with essentially unbounded scalability thanks to its Spanner backend. It also scales to zero so you only pay for what you use, unlike Postgres which has a fixed provisioning cost and an associated performance ceiling.


Postgres is a great default choice for a startup. However, if your product is used on mobile or across the globe, you might find Firestore a better match due to its state-of-the-art backend and client SDKs.

Disclaimer: I used to work on the Firebase databases

Sun May 02 2021 18:55:48 GMT+0000 (Coordinated Universal Time)

How Cloud Run changes Cloud Architecture

Cloud Run is interesting, it's a general-purpose elastic container hosting service, a bit like Fargate or Azure-Containers but with a few critical differences.

Most interesting is that it scales to zero, and auto-scales horizontally, making it very cost-effective for low traffic jobs (e.g. overnight batch processes).

It also runs arbitrary docker containers and can serve requests concurrently, meaning that for modest traffic you don't usually need more than 1 instance running (and you save money).

Its flexibility comes at the cost of higher cold starts though. Take a look at our cold start latencies for an on-demand puppeteer service in a low traffic region:


We are seeing cold start latencies of around 10 seconds, to boot up a 400MB container and start Chrome. This was annoyingly slow.

Not all our regions were that slow though, in one of the busier regions we saw a bimodal latency graph:


suggesting that 2.5 seconds is booting up a puppeteer instance and serving the request, and 5-7 seconds is booting the container. For busier regions often a container is running so that's why sometimes the cold latencies are much lower. (for completeness a warm latency measurement is 1.5 seconds, so probably 1 second is booting chrome, and 1.5 seconds is serving the request).

So... how could we speed things up? 5-7 seconds is spent on container startup. It's our biggest spender of the latency budget, so that's what we should concentrate on reducing.

One solution is to run a dedicated VM, though that loses the horizontal elasticity. Even so, let's do the numbers.

A 2 vCPU 2GB RAM machine (e2-highcpu-2) is $36.11 per month

Now Cloud Run has a relatively new feature called min-instances.

This keeps some containers IDLE but with no CPU budget, so they can be flipped on quicker. IDLE instances are still charged, BUT, at around a 10x reduced cost. The cost for an IDLE 2 vCPU 2GB RAM Cloud Run is $26.28 per month.

This gets pretty close to having your cake and eating it. You get lower latency like a dedicated machine, but also still horizontally elastic. It may even cost less.

For our application, we tried a min-instance of 1 and this was the result.


Our cold start latencies from container startup are decimated! We have not had to change any code.

I think this min-instances feature is a game-changer for cloud architecture. You can now get the benefits of dedicated VMs at a comparable price to dedicated VMs but with elasticity and image-based deployments. The new min-instances feature broadens the range of applications that serverless compute can address.

Our latency monitoring infrastructure and data is public.

Mon Apr 12 2021 15:20:45 GMT+0000 (Coordinated Universal Time)

Don't aggregate your metrics

Recently Visnu Pitiyanuvath of Observable presented how dataviz techniques can be applied to developer dashboards to improve insights. I followed his advice and it had a transformational effect on my work.

(full talk is below but you don't need to watch it right now)

His talk emphasized that we often over aggregate metrics [proof]. Indeed, most monitoring dashboards are time series of the mean, perhaps some percentiles, but a load of line graphs nonetheless.

image Typical Graphana dashboards are not good for building insight

His message was to stop doing that, and just draw every single event, no aggregation into trendlines. Allow your visual system to notice patterns that would otherwise be invisible under aggregation.

I was quite intrigued. The advice felt like the opposite of what the Google SRE book suggests, where SREs are encouraged to distill the system down to a small number of actionable graphs and precise SLO boundaries. (FWIW: I think the difference is SREs expect to be piloting a well-understood system whereas Visnu is advising how you can start to understand a poorly understood system).

Anyway, I was building a new system that had lots of weird performance quirks, so it was a great testbed to try out the techniques he suggested.

Serverless Cells

The system I am developing executes observable notebooks in a serverless environment. It's multi-tenancy Puppeteer on Cloud Run. It potentially has multiple levels of cold starts, and I was interested in understanding the performance characteristic better.

So I hacked together a latency prober notebook, which satisfyingly could utilize the serverless environment to implement, schedule, and execute the probes, as well as visualize the results.

Data collection

The core latency measurement work is a cell named timedRequest that wraps an HTTP call with a timer. Multiple calls to timedRequest are made in the serverless-cell called ping. In particular, for each region, every 30 minutes, two calls are made in quick succession, with the results saved to the Firebase Realtime Database. The first is tagged the "cold" call, and the second is tagged "warm".

    async function ping(region) {
      const coldTime = readableTimestamp();
      const datapointCold = await timedRequest(region);
      baseRef.child(coldTime + region).set({
        tag: "cold",
        time: coldTime,
      const warmTime = readableTimestamp();
      const datapointWarm = await timedRequest(region);
      await baseRef.child(warmTime + region).set({
        tag: "warm",
        time: warmTime,

Scheduling the work every 30 mins is achieved with a cron. I let the cron kick off latency work executed in europe-east1 which probes 3 different regions, europe-east4, us-central1 and asia-east1.

One would hope the latency between europe-east1 and europe-east4 would be the lowest, but of course, our stack is complicated and might route traffic to US, so that was the motivation for the regions chosen.


Visnu suggested using Vega-lite (JS API) and Observable together as a good pairing for quick dashboards. Wow, he was 100% right on how easy it was to draw and refine a decent graph containing all the measurement points read (in realtime) from Realtime Database.

The following code (source) produces the main workhorse of the dashboard. Every latency measurement is plotted, X is time, Y is latency. And each point is color-coded by the response code and whether it was warm or cold measurement.

The visualization includes the ability to hover over individual data points to get the raw data written in a tooltip, which makes localizing the precise time incredibly easy for heading to the system logs.

Here is the full code for building the main dashboard:

    vl.markCircle({ size: 100, tooltip: { content: "data" } })
        vl.calculate("datum.tag + '/' + datum.status").as("type/status")
          .scale({ type: 'log' })
            title: "latency (ms)"

Immediate results - Thundering Hurd Issues

Within 24 hours I had observed my first performance glitch. A cluster of measurements with a constellation of response codes occurring at nearly the same time.


It turned out the system had a bug in its pool mechanism. The intent was that requests to the same notebook would reuse the underlying puppeteer instance. Unfortunately, the pool was not populated until AFTER the instance started. So if n requests came in the system would exhaust its resources booting up n puppeteer instances and generally go bananas. It was a one-line fix


Chasing down the long tail

After fixing the first set of obvious problems, the focus turned to issues in the long tail. We had several unexpectedly long performance measurements. Requests taking longer than 5 minutes, for a service with a timeout of 20 seconds! What the hell?

In the following graph, you can see on the far left a period where some requests took an impossibly long time. The graph is segmented into intervals where we tried out a different fix.


We first noted that the latencies were beyond the timeout setting for Cloud Run. We noted that the Cloud Run severs the client connection at the timeout setting, and freezes the container. But, the express handler continues to run after the container is unfrozen for unrelated subsequent requests.

The fix was to detect and end the long-running processes explicitly (see commit).

However, we then saw a reduction in the number of warm latency measurements. Now that the 20s timeout was truly respected, the latency prober ran out of time to gather the 2nd datapoint. So the 2nd adjustment was to bump the deadline to 60 seconds.

After that fix, it seems to work at first, but latency measurements crept up over time. This turned out to be a problem with the measurement system, not the infra. Each latency prober boots up the latency measurement notebook, which queries the measurement history.

So, the problem is that the Realtime Database will pause when large quantities of data arrive. If that happens in the middle of a latency measurement, then that measurement is stretched proportional to the amount of data in the system. This is why it was trending upwards over time. It's also why some data points are not affected and many are but by a similar magnitude even though they are readings for different regions!

Figuring out that last bug was quite tricky. I ended up running the serverless env locally and step debugging. I ended up pausing during a latency measurement for several minutes, causing a massive spike in latency! But that was just me debugging.

So after adding some logic so that the cron job never loads the dashboard pointlessly:-

if (!getContext().serverless) ...

We were finally back where we were but with no crazy long tails! We were often recording warm latencies below 1s!


Open questions remain. We are seeing more 429 from us-central1 despite it being our least loaded region. And also latency is lower in us-central1 when it should be lowest in europe-west4, indicating our traffic is pointless being routed to the US somewhere (Firebase Hosting origin is only in US).

Visnu was right

The main point of this dashboard was to see if plotting individual measurements was better than aggregated trend lines. My conclusion: absolutely.

Many of the trickier issues were only diagnosed because we could see strange unexpected correlations across groups. Or very precise synchronizations in time. Or problems that affected only a small number of data points.

Aggregation would have washed out those details! I am a convert. Never aggregate!

Its also cool you can host an end-to-end latency prober system in a single notebook, if you want to build your own you can fork mine

Mon Apr 12 2021 12:49:45 GMT+0000 (Coordinated Universal Time)

Simple Article Template


Sun Jan 24 2021 19:49:49 GMT+0000 (Coordinated Universal Time)

Making the Trash Joyfull, Marie Kondo style

We believe in Marie Kondo decluttering maxim that every household object should spark joy. We recently turned around our most disliked object, the trash can, to something that the kids fight over who will empty it. How? here is the story…

To make being at home pleasanter, we pondered which object in the house brings us the least joy? Worse than even the toilet brush, the trash can we find torturous. We hate it to the point of avoiding it 'til trash is spilling onto the floor.

Furthermore, the bags often burst, getting gross liquid everywhere. Yuck, the trash sucks! How could we possibly turn it into something fun? We googled around for top of the range trash cans and found a wonderful concept...

Here is a can that: 1. opens itself, so you do not need to touch it ever. 2. Bags itself when full. 3. Replaces the bag fully automatically! 4. Seals the outgoing full bags.

So this is a new robotic product which I am deeply suspicious of (I have a PhD in robots), but, we bought it 4 months ago, it's still going strong, and it genuinely brings us joy. We love showing guests the rebagging cycle. It’s a showpiece and a point of pride now! Wow!

The kids love activating the rebagging cycle, and will happily take the bagged trash to the front door. The can is small, the bags are airtight sealed and don’t leak.

There are negatives, the bags are small. Though, it makes them easy to take out and stops them from breaking, so we do not mind the more frequent trips. Overall it's a really great purchase that has improved our lives. Marie Kondo is right!

I did not think we could make the trash fun and joyful but it is possible. It’s kinda expensive but it’s well worth it. The product we bought was a Xiaomi Townew T1 Self-Sealing and Self-Changing Trash Can (commercial link)

Mon Jan 11 2021 12:33:12 GMT+0000 (Coordinated Universal Time)

The Internet is a Market for Lemons

Do you get anxious when installing/authorizing software? That feeling keeps you to ‘the beaten path’ which exacerbates inequality and amplifies monopolistic power. Here I explain the underlying economic mechanisms that have turned the modern internet into a battlefield...

Internet software distribution has a huge design flaw, you cannot see what you are buying. You cannot see how your data is processed. You cannot verify the software does what it says it does. In many cases it does not.

For example, your internet service provider, that you pay money to provide you with internet, probably also sells your surfing data, in a free market, to anybody. Nobody would willingly choose that package!

Similarly, apps in the app store advertise doing one thing, whilst hoovering up data to sell to data brokers. If given a choice, you would pick the app that does not do that, but you can’t. You cannot observe how software is going to behave post purchase.

The software market is stuffed with software that advertises doing one thing, but behind your back, also does several other things that are against your best interests. This is why we are anxious. The internet, on aggregate, is actively hostile. Why has this happened?

George Akerlof won a Nobel prize for observing that when providers are more informed than the buyers a “market-for-lemons” forms. The market malfunctions by rewarding sleazy and deceptive practices. This is where we are today on the internet.

When buyers cannot assess the quality of a product directly, they use different buying signals. For software, this is often brand reputation. Established companies are incentivised to play by the rules, as their protection of the brand itself becomes worthwhile.

But this encourages winner-takes-all market dynamics. So we end up with just a handful of household brand technology companies (the FAANGs) whose individual opinions dominate the global narrative. This is not a healthy market for diversity.

We need to amplify the smaller good guys whilst avoiding the bad guys. Remember: this whole mess is because software buyers cannot assess the quality of the software they are purchasing. My investment hypothesis is if we fix this, we fix the market.

So my goal is to improve software service observability. Imagine if end users could view the source code in a continuously publicly auditable system. It would take one motivated technical user to inoculate all the non-technical users against hostile service providers

Serverside open source. We have a demo. Checkout on the Observable platform. Follow the journey on twitter Let’s deescalate the end user/service provider relation.

Fri Dec 11 2020 22:41:18 GMT+0000 (Coordinated Universal Time)

Netlify Deployment Manager Notebook

To recap I am building my blog using Observable as the Content Management System (CMS) interface to a statically deployed site (Jamstack). The main motivation for building yet-another-static-site-generator is that Observable is a web-based literate programming notebook environment. So the unique angle of this jamstack is that the content interface is programmable and forkable (like this) which gives me unlimited creative freedom and extension points as a technical content author.

Even the deployment toolchain is hosted as a notebook that can be forked and customized. This article describes some of the features so far for the deployment notebook.

Netlify Deployment Manager

So I just got partial deployment working nicely so I thought now would be a good time to summarize the deployment features so far.

Some of my frustrations with existing CMSs are

  1. Content changes either take a long time to propagate, or the overall page is slow, depending on the cache settings.

  2. Deployment can take a long time.

Instant CDN cache preloading and invalidation

Netlify solves the cache problems with a smart cache. Caches are not cold because the content is actively pushed to the CDN on deploy, and, the old CDN state is invalidated on deploy. So some hard problems are solved just by using Netlify. Thus the website is super fast without the drawback of stale caches.

Faster Deployment with Delta Sync

The other issue is that static sites tend to be slow to deploy due to an O(n) deployment complexity. Again, thanks to Netlify functionality we can send just the content that changes in deployment. Furthermore, thanks to the CMS data model we can model the dependencies across pages so we only need to regenerate the pages that change too.

Netlify offers a deployment API, so we can deploy content directly from a notebook (see the deployStaticFile call).

Tag-based dependencies

File record metadata is stored in Firestore which plays well with Observable. Each record includes a tags array. When an article is updated, we do a reverse query for pages that depend on file tags using the "contains-array-any" query operator. Examples of content that do this are the index.html and the rss.xml against any files tagged "article". When an article is deployed, the page indexes are deployed too.

Parallel Materialization

To improve deploy speed, each notebook contains a serverside cell used to render the page preview of the page. The process of deployment is materializing of the preview link into Netlify. As the data exchange is a URL, we are pretty flexible about how content is expressed. The content doesn't even need to be hosted on Observable, for instance, the URL could be external (e.g. for materializing 3rd party libraries)

The other useful thing about using a URL as the content representation, and using serverside cells to generate the content, is that we can parallelize materialization jsut by reading the links in parrallel.

The most awesome thing about building on Observable is that this deployment toolchain is hosted within Observable notebooks too. The Netlify Deployment Manager contains all the Oauth code and API calls used to implement the deployStaticFile library cell. You can see how it works and change it whenever you want!

Next steps

The next job is to fix the authentication so it's easier for other Observable users to fork my blog examples and deploy their content on their Netlify accounts. We have not reached a usable service yet but it is getting closer!

-- Tom2

Tue Dec 08 2020 20:29:35 GMT+0000 (Coordinated Universal Time)

RSS Feed added

An RSS feed is an XML file describing what new articles have appeared in a blog. They used to be popular for notifying readers of new content, but that use-case has dwindled in recent years. However, they are still very useful for notifying other computers of changes, enabling a blog to become the hub for personal media automation.

I have now added an RSS feed to the site (here). The RSS feed, like the other pages of the site, is served statically. When a new article is written, the RSS.xml needs to be update too. This requires new technology for the Observable jamstack.

I drew inspiration from Fred Wilson's blog. He writes a ton but the site is quite minimal. He organizes articles by tags, allowing topics to have dedicated lists while allowing a single article to be a member of many lists. Article tags seem enough to build an RSS feed if we can search over articles using them.

Also, to display an RSS item we need a title and description and a few other metadata fields. So on top of tags, support for arbitrary fields was added. The Observable netlify-deploy library now allows previously deployed static files (the atom of static site deploys) to be queried by tags.

So the content to deploy the (RSS.xml) is reactively updated based on the result of a realtime article query. I have granted anonymous read access to the backing Firestore for my blog so those realtime queries can be viewed by anybody.

Tag query support is possible with Firestore indexes using the "array-contains" query semantic. Firestore continues to works very well as the backing store for the Observable jamstack CMS thanks to its realtime and web-based operation.

Tue Dec 08 2020 19:29:10 GMT+0000 (Coordinated Universal Time)

Static site generation in Observable

This post was authored in Observable at @tomlarkworthy/blog-first-post. I love programming in Observable. I have always felt limited by the expressivity of CRMs like WordPress and Contentful. I want to blog using code. I want to use Observable as an interface to a static site.

Write with Code

With Observable I can generate static prose programatically:

                            ##  #                            
                           ## ####                           
                          ##  #   #                          
                         ## #### ###                         
                        ##  #    #  #                        
                       ## ####  ######                       
                      ##  #   ###     #                      
                     ## #### ##  #   ###                     
                    ##  #    # #### ##  #                    
                   ## ####  ## #    # ####                   
                  ##  #   ###  ##  ## #   #                  
                 ## #### ##  ### ###  ## ###                 
                ##  #    # ###   #  ###  #  #                
               ## ####  ## #  # #####  #######               
              ##  #   ###  #### #    ###      #              
             ## #### ##  ###    ##  ##  #    ###             
            ##  #    # ###  #  ## ### ####  ##  #            
           ## ####  ## #  ######  #   #   ### ####           
          ##  #   ###  ####     #### ### ##   #   #          

And this is generated and embedded into a pure HTML site.

Animate with Code

I can also embed Observable cells for dynamic content (kudos Lionel Radisson). Find more of his great code here

So now I have a kick-ass static site that's super easy to update! I don't need to run a CLI command or do a PR to update it. All features can be done in the browser, including the build chain. The whole thing is entirely in Observable. Furthermore, it's all backed by CDN and is super fast, there are no compromises on the output, exactly because it's self authored.

Tech Used

I used a serverside cell called preview to dynamically serve the page. You can see that preview at the following link:

By default, the preview page renders every visit. This is somewhat slow, taking around 2-3 seconds, but it means published changes are reflected quickly. However, it is a horrible URL and too slow for production.

I give the page a nice URL using Netlify. To make the production page fast, I max the shared cache settings in the serverside cell when a production X-Version header is present. Thus, so we lean heavily on the integrated CDN.

On the Netlify end, I set up the page to redirect to the serverside cell URL and add a custom X-Version header. When the production page is updated, the version header is bumped, so the upstream cache is invalidated.

Stay tuned

The personal webpage is a work in progress. Meta tags are missing, the RSS feed doesn't work and it doesn't support more than one page yet! But I will add to this over the next few weeks and hopefully get it to a state where anybody can create a page easily. For now, follow along on Observable RSS feed icon or Twitter.


Tue Dec 08 2020 19:28:50 GMT+0000 (Coordinated Universal Time)

A Zero Install Forkable Jamstack

This blog doesn't require tools to be installed.

Its trivial to write and update content from any computer.

  • Everything required to write content or customize the deployment engine is web hosted.
  • Content is written in Observable notebooks (e.g. this post, an earlier one or the navbars).
  • The deployment toolchain is also hosted in an Observable Notebook (e.g. Netlify deploy).
  • Observable is designed for literate programming. Markdown or HTML or roll your own DSL.

This blog is fast and does not require Javascript

The usual Jamstack advantages apply.

  • Compiled to static assets deployed to a CDN.
  • Exploits Netlify' instant cache invalidation so production updates are fast.
  • Scaleable and secure.

Google Page Speed test

This blog engine is Programmable, Open Source and Forkable.

Because the engine is programmed in Observable:

  • Content is written within a web hosted IDE. You generate content programatically.
  • Content pages, the deployment pipeline are executed in the browser, in cells viewers can look at.
  • All pages can be forked and reprogrammed, allow blog developers to customize their blog engine without installing tooling.