HTML5 in the browser: Local data storage

HTML5 Web Storage, Web Database, FileReader, FileWriter, and AppCaching APIs will transform Web pages into local applications, but not yet

Of all the changes bundled in the HTML5 drafts, few are as radical or subversive as the options for storing data locally. From the very beginning, the Web browser was intended to be a client in the purest sense of the word. It would display information it downloaded from a distant server, and it would do everything the distant server would tell it to do.

Programmers discovered the limitations to this fairly soon, and before long browsers started offering website developers the chance to leave a little piece of data behind. The creators tried giving this 4,096-byte text string a cute name, "cookie," but that didn't stop the controversy. Cookies became the focus as the greater public started to wonder just how the inscrutable gnomes at the central office were tracking their every move. People demanded and got the ability to delete cookies, which limited their possibilities for the developers.

HTML5 Deep Dive
[ Also on InfoWorld: Flashy new presentation tools in HTML5 will make it easier for Web designers to create slicker graphical extravaganzas. See " HTML5 in the browser: Canvas, video, audio, and graphics ." ]

There were deeper problems with the spec. The cookies weren't just stored in the computer -- they were sent back to the server with requests. Savvy Web developers know it's not worth using many of those 4,096 bytes because the cost of accepting too much data on each and every call will drive up bandwidth bills and slow responsiveness.

The HTML5 standards crew chose to fix all of these problems and lay the foundation for the final victory of browser-based software by giving the JavaScript programmer the ability to store practical amounts of data on the local computer. At the simplest, this might be a cache for all of the calls to the central computer, but it can be much more. The more sophisticated programmers might allow the users to store their Web pages locally, imitating the last major feature of desktop software by gaining access to the disk. There's no need to install software any longer.

HTML5 Web Storage: Session storage

The simplest level of Web Storage will store data for the current session -- in other words, as long as the browser tab or window remains open. This may not be a hard limit, however, because the spec leaves open the opportunity for the browser to keep this data around "during restarts."

There's not much to the mechanism. Each document gets a sessionStorage object with a few major functions: setItem, getItem, and clear. The items are just pairs of keys and data just like an associative array. The data is a clone of the current values.

That's about it. New documents get new objects. There's not much difference between storing information in this sessionStorage and declaring a global variable.

HTML5 Web Storage: Local storage

The real advantages come with access to the localStorage object, which looks quite similar to the sessionStorage object but behaves very differently. Where the sessionStorage forgets, the localStorage remembers. Data is supposed to stick around even after the window closes and the computer shuts down.

The persistence goes deeper. Two windows visiting the same website should share the data. A change by the code running in one window should change the data accessed by the other. As the spec says, a storageChange event in one window should propagate to all windows. (This isn't always the case. In some browsers sessionStorage isn't shared between tabs, while in others it is. The edge conditions are not set. The sharing of localStorage and sessionStorage is not perfectly implemented yet.)

There's been some debate over how tightly to limit connection to this object. Right now only scripts from the same scheme, domain, and port are allowed access. This is pretty strict and prevents any confusion that might come about when people load common scripts or switch between HTTP and HTTPS.

While this sounds like a dream for many Web programmers, there's the very distinct possibility that it could cause nightmares because it's easy for two windows to access the same data and create a race condition that corrupts the data. There's a great deal of debate over whether the storage object should defend against this by implementing a mutex (a mutual exclusion algorithm) that can limit data corruption.

One recent draft notes, "The use of the storage mutex to avoid race conditions is currently considered by certain implementors to be too high a performance burden, to the point where allowing data corruption is considered preferable. Alternatives that do not require a user-agent-wide per-origin script lock are eagerly sought after."

This probably won't affect many of the simplest uses of the object, but it could easily produce bizarre errors when people leave several windows open. I often leave Web mail windows alive on my desktop, then open another instance because I'm too lazy to dig through the pile of windows and tabs. Programmers must be aware that different instances of their code will run in the same browser, and this code will have access to the same data.

Note this quote from the spec: "Different authors sharing one host name, for example users hosting content on, all share one local storage object. There is no feature to restrict the access by pathname."

How much room do you get? Can you count on having enough room? Is there a way do defend against DNS spoofing? For all of the questions that localStorage answers, it creates many more.

Web SQL Database and IndexedDB

The key-value pairs in the localStorage object are usually powerful enough for many basic projects, but they're not comparable to relational databases that store the information in indexed tables. For that, there are not one but two options.

The first, the Web SQL Database standard, was drafted and implemented before being abandoned for a more abstract version. People using WebKit browsers and Opera will find that a small database engine, SQLite, was grafted onto the JavaScript engine to let people create tables and store rows using all that knowledge about SQL.

The work, though, was for naught because the committee decided it wanted something else. While the features are still available in the supporting browsers, the Web database standard is now filled with language designed to scare people away. "Beware," it warns. "This specification is no longer in active maintenance and the Web Applications Working Group does not intend to maintain it further."

The new king is the more abstract idea of the Indexed Database, an SQL-free pile of keys and values just like the localStorage object. The difference is that an index can speed finding the necessary object. In practice, this seems to mean that the browsers will store each table in a B-tree to speed lookup and allow the programmer to page through the data in some repeatable order.

The indexed storage also includes the ability to execute the changes as transactions, eliminating the questions about race conditions that may bedevil the localStorage object. This may warm the hearts of database programmers, but it may be too early to know exactly what to expect. The version of the draft I read while writing this includes the line, "TODO: decide what happens when dynamic transactions need to lock a database object that is already exclusively locked by another transaction." There are many details to work out.

HTML5 File API: FileReader

The final apostasies in HTML5 are the FileReader and FileWriter objects, two devices that reach outside of the browser's sandbox and actually touch the file system. It's one thing to offer the JavaScript programmer the ability to store objects from trip to trip, and it's another to let them have access to functions like readAsBinaryString.

The File API is not widely implemented yet, but it promises to dissolve the wall separating the "personal" part of the PC with the "inter" part of the Internet. There's even an updated scheme to make all of the file://C: URIs behave more like distant websites. JavaScript will see fewer differences between loading a local file and downloading data from a distant website with an XMLHttpRequest call.

The details of this API are still missing. The spec is filled with useful suggestions like, "System-sensitive files (e.g. files in /usr/bin, password files, other native operating system executables) typically should not be exposed." Well, duh -- but notice the use of the word "should." The spec suggests that the browser "may" raise a SECURITY_ERR. The details are still in flux, and I don't think anyone knows what may come of opening up this Pandora's box. Perhaps the Web applications will routinely need access to the /usr/bin directory and all of the SECURITY_ERR events will drive the user mad. We can't be certain.

HTML5 File API: FileWriter 

If the FileReader API sounds like a recipe for massive privacy invasions, imagine what potential evil lurks in an API with the name FileWriter. Presumably there will be much good as well, including the ability to simplify the installation of new software. We can only hope.

The design of the FileWriter API is similar to all the other File APIs. You create a block of bytes called a Blob, pass it to a FileWriter object, and invoke the append or write methods. The next thing you know, your disk is filled with viruses. There are also mechanisms so that the viruses can choose between installing themselves synchronously or asynchronously. The data can be found inside the file with offset methods like seek.

The jokes about the viruses should remain jokes if the security model works as planned. Although the draft spec doesn't say much about the security model, it looks as if the goal is to give the average user all the rope they need to tie themselves up and hang themselves inadvertently. The browser will pop up a search box and ask where the data should be stored. Sensitive areas of the OS will probably be off limits, but I still wonder about the damage that could be done with supposedly safe sections. Imagine, for instance, an application that can write a block of bits to the Desktop directory and give it the name "Click Me." What percentage of the world can resist a message like that?

We can hope that the browser builders will move slowly with this tool. Perhaps they'll leave it disabled until a user decides to opt in through the preferences interface. Ideally, they'll bury the access even deeper as Mozilla does with the preferences hidden beneath Firefox's about:config menu.

HTML5 Offline Web Applications: AppCaching

Letting Web pages store data locally can certainly help reduce network traffic by caching the AJAX calls and other important information displayed to the user. It doesn't do anything, though, about the Web page caching itself. (This isn't exactly true because websites can download much of their own logic and evaluate it, but a bit of bootstrapping code is still necessary to start up the page.)

The AppCaching API spells out just how long the browser can keep pieces of the Web pages stored locally. This not only reduces the need for reloading the pages, but it also makes it possible for the Web pages to work without an Internet connection. In other words, it makes them more like installed software.

There's already a fair amount of support for caching in the header tags of an HTML page, but it doesn't extend to the JavaScript files or the CSS pages. The AppCaching API solves this by creating a manifest file that lists all of the important parts of the Web page so that the browser won't get confused. The manifest is listed in the big html tag like this: <html manifest="page.manifest">. The browser then looks at the list of files -- delivered with its own MIME type text/cache-manifest -- and treats them as a unit.

The application cache treats a few items differently. Data calls to server-side CGI functions, for instance, can be labeled in a separate part of the list so that they're not cached. There's also a general fallback section to handle problems. I'm guessing that these might be most useful when loading parts of a page from problematic websites. It could provide icons for unavailable images, for instance, that can't be found at a photo sharing site.

When the manifest changes, the browser reloads everything -- a process that might be transparent but doesn't have to be. There's an ApplicationCache object that fires events whenever significant actions happen. If the code is updated, these events could be used to tell the user.

Embedding custom nonvisible data

Another interesting option is burying the data inside of the DOM tree. The JQuery framework, for instance, comes with a data method that will attach arbitrary objects to pieces of the DOM tree, allowing you to store data with some part of the screen that the user can't see. This makes it simpler to code operations like drag and drop because there's no need to keep separate track of the data and the representation of it.

Some HTML5 developers want to bring this feature to HTML5 in a standard way. The proposal lets arbitrary name-value pairs be attached to parts of the DOM tree. It isn't well supported yet.

HTML5 Microdata

The idea behind the HTML Microdata spec is to create a class of machine-readable metadata tags that websites might add to the visible information. Instead of just inserting the characters "January 1st, 2011" or "New Year's Day," the website builder can add the time tag like this: <time itemprop="birthday" datetime="2011-01-01">New Year's Day</time>.

1 2 Page 1
Page 1 of 2
How to choose a low-code development platform