The W3C Voice Browser Working Group has published the second working draft of the VoiceXML 3.0 specification. VoiceXML is used to describe those annoying call trees you hear when calling most major companies. "Press 1 if you want to wait on hold for 20 minutes and then be hung up on; press 2 if you want to wait indefinitely; press 3 if you'd rather we just hung up on you now."
How does one build a successor to VoiceXML 2.0/2.1? Requests for improvements to VoiceXML fell into two main categories: extensibility and new functionality.
To accommodate both, the Voice Browser Working Group
- Developed the detailed semantic descriptions of VoiceXML functionality that versions 2.0 and 2.1 lacked. The semantic descriptions clarify the meaning of the VoiceXML 2.0 and 2.1 functionalities and how they relate to each other. The semantic descriptions are represented in this document as English text, UML state chart visual diagrams [ref] and/or textual SCXML representations [ref]. Figure 1 illusrates the VoiceXML 3.0 framework which contains some abstract UML state chart visual diagrams representing some existing VoiceXML functionality.
- Described the detailed semantics for new functionality. New functions include, for example, speaker identification and verification, video capture and replay, and a more powerful prompt queue. These semantic descriptions for these new functions are also represented in this document as English text, UML state chart visual diagrams [ref] and/or textual SCXML representations [ref]. Figure 2 contains some abstract UML state chart visual diagrams representing new functionality.
- Organized the functionality into modules, with each module implementing different functions. One reason for the introduction of a more rigorous semantic definition is that it allows us to assign semantics to individual modules. This makes it easier to understand what happens when modules are combined or new ones are defined. In contrast, VoiceXML 2.0 and 2.1 had a single global semantic definition (the FIA), which made it difficult to understand what would happen if certain elements were removed from the language or if new ones were added. Figure 3 contains some modules, each containing VoiceXML 3.0 functionality Vendors may extend VoiceXML functionality by creating additional modules with additional functionality not described in this document. For example, a vendor might create a new GPS input module. Application developers should be cautious about using vendor-specific modules because the resulting application may not be portable.
- Restructured and revisedDefined the syntax of each module to incorporate any new functionality. Application developers use the syntax of each module as an API to invoke the module’s functions. Figure 4 illustrates some simplified syntax associated with modules.
- Introduced the concept of a profile (language) which incorporates the syntax of several modules. Figure 5 illustrates two profiles. For example, a VoiceXML 2.1 profile incorporates the syntax of most of the modules corresponding to the VoiceXML 2.1 functionality which will support most existing VoiceXML 2.1 applications. Thus most VoiceXML 2.1 applications can be easily ported to VoiceXML 3.0 using the VoiceXML 2.1 profile. Another profile omits the VoiceXML 2.1 Form Interpretation Algorithm (FIA). This profile may be used by developers who want to define their one own flow control rather than using the FIA. Profiles enable platform developers to select just the functionality that application developers need for a platform or class of application. Multiple profiles enables developers to use just the profile (language) needed for a platform or class of applications. For example, a lean profile for portable devices, or a full-function profile for servers-based applications using all of the new functionality of VoiceXML 3.0.
One of the benefits of detailed semantic descriptions is improving portability within VoiceXML. Two vendors may implement the same functionality differently; however, the functionality must be consistent with the semantic meanings described in this document so that application authors are isolated from the different implementations. This increases portable among platforms that support the same syntax. Note that there are many other factors that effect to the portability that is outside the scope of this document (e.g. speech recognition capabilities, telephony).