Detecting Browser/Tab crashes POC
This POC shows how browser crashes could potentially be detected.
How to run it
- Run
node ./server.js
- Open http://localhost:1234
- You can open multiple tabs (each tab will get a unique name)
- Logs are sent to the terminal via server.js
- Try various actions that can simulate a crash
- Once a crash is detected it will be sent to the server and stored in local memory
- http://localhost:1234 will show crashes that were reported
Resources
- https://github.com/getsentry/sentry-javascript/issues/5280
- http://jasonjl.me/blog/2015/06/21/taking-action-on-browser-crashes/
- https://medium.com/@JackPu/how-to-check-browser-crash-via-javascript-fa7d5af5e80b
Tested approaches
- Detecting crashes before they occur
- Track and persist state of tabs (last alive ping + if it was closed properly). Send crash reports based on the state.
Detecting crashes before they occur
The idea was to check if the page becomes unresponsive or is very close to hitting memory limits and report it over HTTP to persisted storage.
The tab may be near crash when:
- Memory usage can be checked with
window.performance.memory
- Pros:
- It provides total JS heap size, used heap size and the limit
- Cons:
- Browser may dynamically change limits and allocate additional memory
- Available only in Chrome
- When browser slows down. This could be checked by a ping mechanism using web/service workers.
- Pros:
- Available in all browsers
- Cons:
- When only one tab is opened and crashes the service worker may be killed immediately and there might be not enough time to ensure it manages to send a report about the crash. Based on some experiments only Firefox keeps the worker alive a bit longer.
I wasn't able to get reliable, consistent results with this approach
Track and persist state of tabs
The idea is to track active tabs and last active pings + stop tracking when tab closes correctly. Based on that info if the tab stopped sending pings + it was not closed correctly we assume it's frozen or crashed.
Detection would consist of following components:
- Storage to keep information about active tabs
- Client code used to periodically report the state of a tab to the storage
- Detection logic that periodically checks the state of storage
POC
In the POC following approaches were considered:
- Storage
- Browser local storage:
- Browser session storage
- Browser indexed db
- pros: can be shared between workers and allows transactional updates
- External storage (over HTTP):
- cons: will stop reporting when the user if offline though the tab may not crashed
Choice: IndexedDB
- Client code to periodically report the state (ping + closing)
- setInterval inside the tab thread
- cons: when tab is inactive setInterval gets deprioritized and will be executed less frequently
- setInterval inside service/web worker thread
- pros: when a message is sent the inactive/invisible tab can respond immediately
Choice: setInterval inside a service/web worker
- Client code to save state to storage
- save inside the tab thread
- cons: debugging the page will stop the thread and detector could say it crashed
- save inside service/web worker
- pros: worker will keep working even if the tab is paused;
Choice: Save inside a web worker
A caveat is that Firefox doesn't kill the web worker immediately when tab crashes. This could lead to scenario when the detector thinks that the tab is still alive. At the same time we need to track the time tab was last active. To mitigate it we can keep both: last time the tab was active (for reporting) and last time the worker was active (to detect crashes).
- Detection logic
- In a service/shared web worker
- cons: can be a single instance running independently to tabs
Choice: Use shared web worker. In theory it should work with a service worker as well though based on experiments service worker may be killed when tab crashes, while shared web workers seems to keep running.
sequenceDiagram
autonumber
ClientController->>ClientWorker: Start
loop Update Loop
ClientWorker-->+ClientWorker: setInterval(..., 1000)
ClientWorker-->>ClientController: ping
ClientController->>ClientWorker: on ping from worker: post update { id, url, memory, ... }
ClientWorker->>IndexedDb: put { id, url, memory, tabLastActive, ... }
end
loop Activity Loop
ClientWorker-->+ClientWorker: setInterval(..., 1000)
ClientWorker->>IndexedDb: put { workerLastActive, ... }
end
ClientController->>ClientWorker: Stop
ClientWorker->>IndexedDb: delete { id }
- Client code executes in the same thread as the main app. It's responsible for starting the update loop in the worker.
- WebWorker starts the loop with setInterval. This is done in the worker to avoid slowing down setInterval on inactive tabs
- WebWorker pings the client for the data (WebWorker have no access to url, memory usage, etc.)
- WebWorker save the data with tabLastActive timestamps to the IndexedDB when receives a message from the tab. Saving is done in the worker to ensure it's a separate thread in cases the Client thread is paused because of debugging.
- WebWorker saves workerLastActive timestamp every second
- When Client is unloaded properly it sends the message to the WebWorker to remove the entry from IndexedDb
A separate process check for stale tabs and reports back to the backend. It connects to the same IndexedDB
sequenceDiagram
autonumber
Detector->>+Detector: setInterval(..., 1000)
Detector->>IndexedDb: get all tabs
Detector->>Detector: check if workerLastActive > 3 seconds
Detector->>-Backend: /crash-report { id, url, memory, tabLastActive ... }
workerLastActive timestamp is used to detect actual crash of a tab and tabLastActive is used for reporting. They may be out of step in Firefox which keeps the worked active after the tab crashes OR when thread on the tab is paused due to debugging (web worker will keep running)