Version 60, changed by brian 03/21/2007. Show version history
This page is for collaborating on design work and notes about the dojo.data package.
Please feel free to jump in. You can contribute by adding notes and ideas and comments to this page. Code snippets would be especially useful additions -- if we can figure out what we want the APIs to look like from the caller's perspective, that will help the implementations to fall into place.
We currently have two experimental branches of dojo.data code:
We're working on coming up with a standard set of terminology, so that we're all speaking the same lingo.
Here's a list of some of the words that have come up in different places in the code and the working notes. We probably want to prune some of the synonyms from this list:
- Workspace = Model = Package
- DataProvider
- Record = DataObject = Item
- DataClass = Class = Kind
- Bind = Binder = Binding
- Attribute = Member = Property
- Field
- Reference = Member
- Feature = StructuralFeature = Member
- Adaptor
- Child / Children
- Mediator
- Structured
- Relational
- gid = globalId
- Signature
- Root
- VBL expression
- path = XPath expression
- xmiId
- Value
- Container
- ClassType
- Repository = Database = Data store
- Result Set
- DataSet
- Graph
- ...to-do: add more terms...
- ...to-do: add descriptions...
- ...to-do: group the terms...
Also, the chart on this other page has a terminology comparison section: http://openrecord.org/dojo/2006-01-09/data_model_comparison.html
Overloaded words
Here's a list of some of the JavaScript reserved words that we may want to avoid:
- class
- interface
- namespace
- package
- static
- transient
And here are some more terms that already have other meanings in the context of AJAX programming, which we might want to be careful about using, to avoid confusion for people who are reading through our code/APIs/documentation:
- id -- HTML, CSS
- class -- HTML, CSS
- attribute -- XML, HTML
- element -- XML, HTML
- object
- field
Proposed terminology
data provider -- A data provider is a JavaScript object that knows how to get data from a data source and copy the data into standard data structures in the browser's memory.
data source -- Examples of data sources include CSV files, XML files, relational databases, XML databases, RDF datastores, HTML data islands, and web services like del.icio.us.
schema -- Some data sources have associated schema information. A schema can include information about the different kinds of records in the data store, the attributes associated with each kind of record, the type of each attribute, etc. For example, a schema might define things like book, author, title, and ISBN number.
content -- Content is the non-schema data from a data source; the actual data records from a data source.
record -- A record is an in-memory data structure, in a workspace, that corresponds to one row in a relational database table, or one line from a CSV file, or one resource in an RDF graph, or one object in an object-database.
(To-do: is the word record a bad choice, because it could be confused with the record itself in a relational database? The experimental dojo.data code I (Brian) wrote uses the term item -- I like that term, but maybe that's just because I'm already familiar with it, from the OSAF Chandler project and from OpenRecord. SDO and the IBM code use the term Data Object, but I'm worried that in time that might grow cumbersome, with lots of lines of code like var dataObject = new dojo.data.DataObject(); -- I'm worried that people will end up using shortened alternatives like do and object, which could lead to really confusing code.)
attribute -- An attribute corresponds to a column in a relational database table or a CSV file, or a property in an RDF statement.
(To-do: the word attribute could cause confusion, since it already has a specific meaning in the context of HTML and XML. Should we pick a different term, like member?)
workspace -- A workspace is an in-memory copy of information from an external data source. A workspace can include both schema information and content records. A workspace might include a complete copy of an entire data store, if the data store is small (for example, a little CSV file). Or, a workspace might cache only a small working set (a subset of the data store), and dynamically load more content from the data source as necessary.
representation -- A representation refers to the data structures that a workspace uses to store records in the browser. One representation might store content in XML DOM nodes, while another might use simple anonymous JavaScript objects, while another still might use dojo.decare to create different constructor functions for each kind of record being stored. (Any workspace representation should be independent from the data source representation, and the data provider should handle any type conversion or format conversion necessary.)
(To-do: would data model be a better term for this?)
class -- The word class refers to the html class attribute itself, or to one of the values assigned to the class attribute, as well as the use of that class in CSS stylesheets. To avoid confusion, the word class has no other meaning in dojo.data context.
kind -- A kind of record in a dojo.data schema is like a class (of objects) in Java, or a table (of records) in an RDBMS.
(To-do: kind seems a little goofy. Can we come up with anything better, while avoiding the word class?)
type -- A type is the data-type of a value in a record in a data store.
value -- A value can mean either simple literal value or a reference to another record. A "simple" literal value may not be simple, given the wide variety of literal types available from different data sources. (For example, XML has 16 types of numbers and 9 types of dates).
package -- A package in the dojo source tree is a directory of JavaScript files. For example, you can include the dojo event package by using the line dojo.require("dojo.event.*"). To avoid confusion, the word package should have no other meaning in dojo.data context.
binding -- A binding is a connection between a part of the UI and a part of the content in a workspace. A simple binding could bind a single UI input field to a single attribute of one record. A bigger binding could bind an entire Grid widget to a set of records. Bindings keep the UI in sync with the workspace, and the workspace in sync with the UI.
data transport -- A data provider might use a data transport to get records from a remote data store, after which the data provider might do the format conversion necessary to load the records into the representation being used for some workspace.
data set -- A collection of records; a subset of all the records in a workspace.
query -- A query is some text string that can be used to fetch a set of records from a data source. Different data sources may offer different query languages, from XPath to SQL to SPARQL. The dojo.data layer does not know about any specific query languages, and merely accepts queries from client code and passes them through to a data source.
...to-do: talk about Controllers, Bindings, Model, Schema, Data Providers, and how they all fit together -- maybe make a drawing...
How many data model representations should dojo.data have?
Option 1: functionality added in layers
Brian Skinner proposes having a single JavaScript data representation (data model implementation) for holding both structured data (such as RDBMS records) and semi-structured data (such as RDF resources). The theory being that the structured data is basically just a constrained version of the semi-structured data, with constraints like strong-typing and cardinality restrictions. Our core data model representation could be unconstrained, and then constraints could be added in layers, by loading additional dojo.data packages. I don't know how realisitic this approach is, but if we can do it this way, it seems like there would be a huge win in only having a single version of the core data structures.
This same layering approach could be used for other aspects of the single core representation. For example, the most basic package could be extremely lightweight, designed only for read-only use and offering only a get() accessor. Applications that need to do read/write access could load an additional package that would extend the core representation to provide a set() accessor. Similarly, the most basic package could represent all literals as simple JavaScript literals (like strings and numbers), and additional optional packages could add support for more specific data types (like int, float, varchar(255), etc.).
Option 2: different packages optimized based on different scenarios
Another alternative would be to have completely different JavaScript data representations (data model implementations) for holding structured data (such as RDBMS records) versus semi-structured data (such as RDF resources). Each data model implementation would conform to a standard data model interface, so widgets could be bound to the data without the widget code or the binding code needing to know which data model implementation was actually being used. This approach would allow each data model implementation to be optimized specifically for the set of constraints that it enforces, without needing to be overly general.
This is the approach that has been taken in the IBM datamodel contribution. The main reason is that for XML data sources, a data model implementation which uses the native browser support for XML can be used rather than creating a complex mapping layer that converts from XML to internal JS data structures (although an XML serialization/deserialization could still be provided if data needs to be interchanged between XML and JS data sources.
Another example where it is advantageous to allow multiple datamodel implementations is for the use case where JSON data is simply being displayed (read only) and manipulated programmatically via property access. In this scenario, JS objects/arrays can be used directly as the data model implementation, with the restriction that more advanced features are not possible because property set/get can not be intercepted.
If this approach is taken, it is important that the programming model (interface) for data access across the implementations be consistent, requiring a hybrid approach to the
How do you locate a datasource? How do you obtain the workspace or a data management singleton?
TBDData model API -- object-oriented vs. centralized
Option 1: object-as-parameter API
With an object-as-parameter API, to set and get the attributes of a record, you call a method on some central data manager like a workspace, and you pass as a parameter the record that you want to operate on. For example:
var workspace = new dojo.data.Workspace();
var record = workspace.newRecord();
workspace.set(record, 'name', 'Kansas'); // sets the value of an attribute
workspace.set(record, 'abbr', 'KS');
var abbr = workspace.get(record, 'abbr'); // gets an attribute valueOne advantage of the object-as-parameter API is that the API almost completely hides the record object itself, so the data model implementation is free to store the data in pretty much any data structure it wants to. The data could be stored in an XML data island within a web page, or the data could be stored in simple anonymous JSON objects.
Another advantage to this approach over Option 4-identifier/handle-based access is that it is more efficient (and one could argue a requirement) that applications be able to directly manipulate the data object using the native object interfaces specific to the data object's implementation. For example, if the data is held in XML objects implemented by the host environment, the data can more easily be manipulated via web-standard api's for DOM or via transformation.
The disadvantage in this approach is portability of application logic between various data source implementations [replace "data source" with "data provider" to match terminology section above?]. However, it is very common for application logic to have knowledge about the kind of data it is working with. If we take a handle/identifier approach, we lose the flexibility of jumping into the data's implementaiton and will need to provide some other way to determine the implementation kind, given an identifer, and a way to convert an identifier into the object that it represents.
Option 2: object-oriented API
With an object-oriented API, to set and get the attributes of a record object, you would call a method on the record object itself. For example:
var workspace = new dojo.data.Workspace();
var record = workspace.newRecord();
record.set('name', 'Kansas'); // sets the value of an attribute of the data object
record.set('abbr', 'KS');
var abbr = record.get('abbr'); // gets an attribute valueAn advantage of the object-oriented API is that it's more concise than the object-as-parameter API (shorter and easier to type), and maybe also easier to read and understand.
A disadvantage to this approach is that it cannot be used with data model implementations where you do not have control over the data model's implementation (such as XML or JS data structures).
Option 2.1: object-oriented API (strongly typed)
For completeness, we should mention this variation on Option 2, the object oriented-style record object:
A variation on the object-oriented API is that the data object implementation can create strongly typed property accesssors and mutators for each property described in the class of a data object, as follows:
var workspace = new dojo.data.Workspace();
var record = workspace.newRecord();
record.setName('Kansas'); // sets the value of an attribute of the data object
record.setAbbr('KS');
var abbr = record.getAbbr(); // gets an attribute valueThe only advantage to this approach is that it's slightly more object-oriented for the programmer.
Disadvantages:
- Application logic becomes dependent on the classes/interfaces in the model. Option 3 only introduces application dependencies on the generic data access api.
- Additional functions must be created for every attribute/relationship in the model.
- This style cannot be used when you do not have control over the data model implementation (JS/XML)
For these reasons, both Chris and Brian would not recommend supporting this style of access as a feature.
Option 3: identifier-as-parameter API
The identifier-as-parameter option is similar to the object-as-parameter API, in that all access is controlled through a "universal" data manager such as a workspace. The difference is that in this API you pass some kind of identifier to specify a record, rather than passing a record object itself. For example:
var workspace = new dojo.data.Workspace();
var identifer = workspace.newRecord();
workspace.set(identifer, 'name', 'Kansas'); // sets the value of an attribute
workspace.set(identifer, 'abbr', 'KS');
var abbr = workspace.get(identifer, 'abbr'); // gets an attribute valueAn advantage of the identifier-as-parameter API is that it completely hides the record representation itself, leaving the data model implementation free to store data however it wants to.
Different data-sources might have different types of identifiers, so an identifier might be a UUID, a URL (for RDF), a compound key (for a RDBMS), or an XPath expression.
Using identifiers with key information can complicate the api, since even for a single data object implementation there can be multiple ways of querying and locating records. Rather than combining property access and mutation together with the capability of locating a record which the workspace implementation would need to figure out somehow, it would be simpler to have one api for property access/mutation on the workspace that works on a record that has already been found, and another api (or set of api's) for how you find and identify records.
Option 4: direct access data structures
A couple people have suggested an alternative in which there really isn't an API at all. Instead of calling getter methods, client code reads values directly from the data structures themselves. You could follow this approach with a data structure built from either XML DOM nodes or JavaScript anonymous objects. This approach has the advantage of being incredibly simple and lightweight, at least for some simple tasks. It would work especially well in use cases that fit this pattern:
- read-only access
- the entire data set can be loaded in memory at once
- simple JavaScript data types are fine
However, if we start with this approach, then there's no way to incrementally add features like validation, on-demand loading, multiple UI bindings that are automatically kept in sync., etc.
This approach can be supported behind a data access api, but not an object-oriented api
Option 5: a hybrid
Another alternative is to simultaneously offer two or more APIs. For example, offer both object-oriented and identifier-as-parameter. Either one could probably be added as a fairly simple shim around the other. That way the object-oriented API would be available for people who found that easier, while the identifier-as-parameter API could be used when performance was an issue. That could be "best of both worlds", or it could be "design-by-committee".
This is the only viable approach that doesn't box us into a corner. We need a way to uniquely identify an instance of data both within a client, as well as between the client and a server (over multiple requests), but in other cases we'll require flexibility of dropping down to the native data model implementation for efficiency in manipulating the data and will want access to the record's via object-oriented style in the case of JS. To simplify implementation and programming for simple read only scenarios where data is in JSON format, an extra conversion to get the JS data into a canonical data format is unnecessary. Similarly, if the data's coming in from a source in XML format, which browsers can handle natively, we should be able to keep the data in the source format without keeping a canonical JS copy of the data in memory and performing costly coversions with no added benefit, allowing use of implemented web standards such as XPath and XSLT on the data. At the same time, it would be highly preferable to have some consistency in data access/mutate api's (for both single record (attribute) and graphs of data (path) based access).
Aspects of the API
The options above include examples of getter and setter calls. In reality we might have not just getter and setter calls, but also a variety of other methods, for dealing with widget bindings, incremental data loading, introspection, version histories, attribution, type conversion, sorting, etc.
The examples above assume that the data record has been located first by some other means/api, prior to individual property access.
The examples above only cover properties with simple data types. Additional scenarios need to describe how more complex relationships such as parent/child containment and references to other data objects should be handled for data access.
XPath expressions
XML is one of the common components of AJAX programming, and XML data sets may be one of the most common types of data set that people want to use the dojo.data package to work with. In an XML context it's natural to want to use XPath expressions to specify nodes (data objects) or attribute values in an XML data set.
If the dojo.data package uses XPath expressions in the API for XML data providers, how should the dojo.data package handle XPath in the API for other sorts of data sources (JSON, CSV, relational, etc.)?
Option 1: Have all dojo.data APIs use XPath (or a subset of XPath)
We could strive to ensure that the data model API is completely uniform for all data sets, regardless of the data source that the data came from. We could define a reasonable subset of XPath, and restrict the API to using that subset. For a non-XML data model, we could translate the XPath expressions into a semantically equivalent query.
XPath subsets
Some of the features available in XPath may not map well to data sets that come from sources like CSV files and relational databases. For example, we might have problems with the XPath predicates that assume that result sets are ordered, like book[3].authors[2] or book[3].authors[last()]
Option 2: Use XPath only for XML data providers
Option 1 might be too complicated. A simpler alternative would be to only use XPath for XML data providers.
Option 3: Never use XPath
Perhaps option 1 is too complicated, and option 2 fails to ensure that the data model API protects the widget and data binding layers from needing to know about the details of different data sources. Another option is to consistently never use XPath.
Client-side Derivation Rules
If we want to have derived attributes that are calculated on the client, how should derivation rules be expressed?
Option 1: XPath expressions
Example:
pricePerPage.setDerivationRule("number(../book/price/text()) div number(../book/pages/text())");Advantages:
- well-specified standard
- possible to identify dependent variables by parsing xpath
Disadvantages:
- has to be implemented from scratch?
Option 2: JavaScript expressions
Example:
pricePerPage.setDerivationRule("price/pages");Advantages:
- well-specified standard
Disadvantage:
- lends itself to being implemented using eval, which creates a security hole
- in practice, has to be implemented from scratch?
- dependent variables (constraints) need to be specified when the derived attribute is declared if fine-grained updates need to be supported
Option 3: ???
Other de-facto standards for simple formulas:
- Excel
- Visual Basic
- etc.
Dependent Variables in Derived Attributes
An issue that can come up with derived attributes is with being smart about updating content which is rendering derived values when dependent variables (variables that the expression is dependent on) change. For many use cases, this is not a problem in practice, since a complex widget can simply refresh all of its bound data values to get the latest values. However, in some cases like a spreadsheet this can be problematic if the expression takes too much time and a large number of calculated values is being displayed by the widget.
This problem exists for both XPath and Script approaches to derived attributes (but is simpler to deal with in the xpath style expressions which are more easily parsed for dependent variables).
Unless something like a constraint system is implemented and dependent variables are declared for derived attributes, workarounds typically result in application logic needing to be written to be smart about handling updates.
Identifiers
Different data sources use different forms of identifiers:
- compound keys (as in an RDBMS)
- URIs (as in RDF)
- UUIDs
- none (as in a CSV file)
- XPath expression (to identify item in an XML document)
Most identifiers can be treated as generic text strings (URIs, UUIDs, XPath expressions). But in general an RDBMS key won't map especially well to a simple text string, and the RDBMS use-case is probably one of the most common use-cases. So the question here is how should our JavaScript data model representation handle identifiers?
Option 1:
...to-do: propose a solution...
Here are some ideas for unit tests...
Idea 1: equivalent representations
Come up with very simple data set, like this one:
Title Author Publication Date Price Harry Potter and the Goblet of Fire J.K. Rowling July 8, 2000 18.89 The Future of Ideas Lawrence Lessig October 22, 2002 9.75 The End of Poverty Jeffrey Sachs February 28, 2006 10.40 Then make a variety of files with representations of that one data set in different storage formats:
- XML
- JSON
- CSV
- RDF
- HTML table
- SQL table
- in-memory (built programmatically)
Then have the unit test read the data set out of each the storage formats and into each of the in-memory formats that we support:
- XML
- JavaScript
If we started with 6 storage formats and 2 in-memory formats, now we'll have 12 simultaneous data models loaded in memory. At this point the unit test can run a simple "diff" that compares the data models. The unit test fails if any of the data models have differences.
Idea 2: multi-valued attributes
Just like idea 1, but including a few books that have more than one author.
Idea 3: references
Like idea 2, but with a separate authors table, and references from the book records to the author records.
...to-do: add more here...
Idea 4: data transfer
Start with data from a data source accessible by one kind of in-memory implementation.
Transfer the data to a data source accessible by a different kind of in memory implementation.
...to-do: add more here...
adamsz said, 07/08/2006
re: class/kind/type: i think we should replace this with class and datatype -- anyone familar with any programming experience will know what these mean as opposed to these terms which are ambiguous. HTML's "class" attribute isn't really involved in this discussion so i don't see how that creates confusion. If you do need to refer to it, saying "HTML class attribute" is clear enough.