Department of Computer Science
Drexel University
This work was supported by NSF under Contract ANI-0133537.
We have developed the ``Ethernet Speaker'' (ES), a network-enabled single board computer embedded into a conventional audio speaker. Audio streams are transmitted in the LAN using multicast packets, and the ES can select any one of them and play it back. A key requirement for the ES is that it must be capable of playing any type of audio stream, independent of the streaming protocol, or the encoding used. We achieved this by providing a streaming audio server built using the kernel-based audio stream redirector (ASR) in the OpenBSD kernel. The ASR accepts input from any of the existing audio file players, or streaming audio clients. Since all audio applications have to be able to use the system audio driver, our system can accommodate any protocol or file format, provided there exists some compatible player running under OpenBSD.In this paper we discuss the design and implementation of the server as a ASR, the streaming protocol developed for this application, and the implementation of the client.
Keywords: Virtual drivers, OpenBSD, audio, multicast.
Audio hardware is becoming increasingly common on desktop computers. With the advent of easy to install and cost-effective networking solutions such as Ethernet and more recently, Wireless Local Area Networks, there has been an increasing interest in allowing audio sources to be shared by multiple devices. One central server can stream audio data to a number of clients all part of the same network. The clients just need to capture the streamed data and play it on their audio devices (Figure 1). This concept can work at home, as well as a campus or office LAN where it can be used as the basis for wireless surround audio systems, background music, public announcements, etc.
However, the popularity of audio services in the Ethernet has resulted in the deployment of a large number of protocols and data formats producing an on-line Babel of mismatched applications and protocols. Moreover, most of the popular streaming services have adopted incompatible and proprietary formats. It is particularly noteworthy that even though Real Networks released the source code for Helix Client, they did not provide source for the RealAudio G2 and RealAudio 8 decoders (Helix, 2002). In addition to the streaming audio formats, there are also a large number of encoding schemes for file-based audio players (MP3, wav, and so on). In order to be able to cope with all these formats, users have to use multiple audio playback applications, each with its own user interface and peculiarities. Moreover, most commercially available streaming servers cannot stream audio files in their native formats. These files need to be ``hinted'' with metadata which aids the streamer in generating Real Time Protocol (RTP) packets (Apple, 1998). This mandates that the audio files undergo a format conversion before they become eligible for streaming.
Our goal was to produce an audio server that would be independent of the format of the audio source. We also endeavoured to create a single client that would be able to accept music from any of the above players without the need for specialised plug-ins or updates. In the next section we discuss the issues surrounding the design of the audio client and the streaming server. We, then, describe our prototype implementation which allows the redirection of the output of off-the-shelf audio player applications using a kernel-based redirector. Finally, we discuss how our work compares with related projects and systems, and our plans for future enhancements.
A typical audio device (Lowe et al 1998), as shown in Figure 2, consists of two logically independent systems, one for recording, and another for playback. In the recording section, analog input is first passed through a mixer before being fed to an Analog to Digital Converter (ADC), whose output is subsequently stored in a buffer. The audio hardware dictates various configuration parameters such as resolution (8 bits or 16 bits), data formats (linear or ì-law), channels (mono/stereo) and the sampling rate.
The playback section makes use of a similar buffer to hold data that is to be passed to the Digital to Analog Converter (DAC). The output of the DAC is channelized through a mixer before being sent to the speakers. The operating system also provides format conversions to supplement the native features supplied by the hardware.
An audio playback application like mpg321(1), works by continuously reading blocks of audio information from the source (file, or network connection), performing some processing and sending the result to the audio device for playback. This is done by writing blocks of data to the audio special device (/dev/audio). This data is placed in the playback buffer. The device driver further divides the playback buffer into smaller size blocks before passing the data to the audio hardware. I/O control (ioctl(2)) calls are used by applications to ask the device driver about the capabilities of the underlying hardware, or to set various parameters for the audio device (e.g. sampling rate).
The audio device driver in the OpenBSD 3.1 kernel is divided into a high level hardware independent layer, and a low level hardware dependent layer. The high-level driver interacts with user-level applications. It provides a uniform application programming interface to the underlying hardware dependent driver modules.
Low-level drivers provide a set of function calls (NetBSD) which allow the hardware to be integrated into the runtime environment of the operating system. A device driver does not need to implement all the calls, but at a minimum a useful driver needs to support opening and closing the device, device control and configuration, and I/O calls.
Our technique involves intercepting the audio stream as it flows through the upper half of the audio driver and sending the audio information to the network (grey box in Figure 2).
The Audio Stream Redirector
Instead of writing a user level application using streaming protocols (e.g. RTP) and CODECs for complex encodings such as MP3 (MPEG Layer 3), we have added a new audio stream redirector (ASR) to the OpenBSD operating system. The ASR packetizes and multicasts the audio play buffers onto the network.
The presence of the ASR should not be detectable by the application, thus an audio subsystem with an ASR should be indistinguishable from a real audio device to the application program. This is achieved because the ASR does not affect the normal operation of the audio driver so the application is totally unaware of the redirection of the audio information to the network (Figure 3).
By positioning the audio redirector in the kernel, we provide an audio streaming facility that does not depend on the audio application, or the encoding format. Moreover, the music file does not need to be in a special ``hinted'' format unlike the Darwin Streaming Server ( http://developer.apple.com/darwin/projects/streaming). This enables the ASR-based server to accommodate new audio encoding formats as long as a player has been ported to OpenBSD.
The redirected audio stream is bound to a specific multicast address (and port). Once the program opens the audio device and starts writing to it, the ASR will begin sending packets to the network.
Our client, which we call ``The Ethernet Speaker'' (ES) is a rather simple device, comprising a single board computer (SBC) running an embedded version of OpenBSD, a network connection (preferably wireless), an audio card, and a pair of speakers. The ES receives music from the network and plays it on its speakers. Since Drexel has a campus-wide network, the ES can receive music anywhere within the our campus.
Being an embedded device, the ES has limited resources, so it was considered important to standardise the format of the incoming streams. Another constraint is synchronisation; if more than one ESs are playing the same channel within earshot, any delay between the two outputs creates cacophony. Finally, since we expect a large number of ESs to be deployed in our campus, we had to minimise the impact of the ESs on the network resources.
These three constraints made the use of an off-the-shelf player infeasible. We briefly experimented with the Real Time Audio Tool (Kouvelas, 1997) but it did not allow us to convert, for example, Real Audio streams into something the ES could play.
In the end we decided to use multicast packets for the streaming audio with one address per channel. This achieves synchronisation between playing ESs and reduces the load on the network. We then redirect the output of existing player applications into the network. In this way we do not need to have the players in the ESs but in one or more audio servers. By removing the players we can also standardise the user interface of the ESs. Finally, we redirect the packets received at the ESs to their corresponding audio devices without any post-processing. This reduces the processing power requirements at the ESs and achieves format standardisation of the incoming streams. To avoid burdening the server, we also require that the ES selects which channel to play, without having to make arrangements with the server.
The communication protocol between the audio streaming server and the ESs has to be lightweight and easy to implement. On the server side all the protocol is running in kernel space, so a simple design decreases debugging (and kernel reboots). Also, a simple protocol imposes a lesser burden on the resources on the client side, allowing lightweight clients to be used. Given the real-time nature of the transmissions, and the continuous nature of the audio stream, there is no need for retransmissions or error correction. Incorrect packets are simply discarded. This is acceptable in our target environment which, being a LAN, exhibits low packet loss rates.
Clients may join a transmission at any time and are able to recover from lost packets or transient network problems. This allows the server to be oblivious to the number or type of the clients operating in the network. Finally, we use multicast packets to reduce network load and synchronise the clients.
When a client application accesses the audio device (via the open(2) call), data flows to the network via the ASR. This is done by having the ASR open a kernel UDP socket. The destination address in the socket's address structure is set to a predefined multicast address and destination port.
Every time data gets written into the play buffer by a user-level application, it is packetized and sent to the network in the following manner: a special control packet containing information to configure the audio device at the client side precedes the transmission of the actual data. The contents of the play buffer are then sent as one or more 1024 byte packets. The socket is finally closed when the user process terminates the connection with the audio driver.
A key decision was whether to send the audio stream directly to the network, or pass it to a user-level process which would then decide what to do with the audio stream. Although the latter technique is more flexible and allows better handling of the audio stream, we decided not to adopt it as it is particularly wasteful in terms of resources. Since the main application of the ASR is the redirection of audio streams to the network, directing the stream to a user-level process which then transmits it to the network would involve superfluous data copying and context switching. We, therefore, decided to inject the audio byte stream directly into the network as a series of multicast packets. This allows the ASR to construct the packets and forward them to the networking code entirely within the kernel.
Synchronising the audio output of two devices is not as simple as ensuring that we send the audio data to both devices at the same time. Variations in the routing topology (e.g. different number of hops between the server and the various receivers), Operating System actions (e.g. interrupt processing, scheduling the client process, etc.) may result in loss of synchronisation. The human ear is sensitive to the resulting phase differences, so we were careful to create an operational environment where such conditions are unlikely to occur. We have standardised the configuration of the clients (hardware, operating system, applications), and our network topology follows a logical design where machines in the same area are normally connected to the same Ethernet segment. The only case where this is violated, is where we have wireless clients next to a client connected to the wired LAN.
The client must select a multicast stream and then start playing the audio. The design of the protocol is such that the client never contacts the server. We achieve this by interspersing control packets between the normal data packets. The control packets contain sufficient information (sampling rate, bit precision, encoding, channels etc.) for the client to configure its own audio hardware so that it can process the data sent by the server. The data in the ordinary packets is sent directly to the audio device via the standard interface (/dev/audio).
Multimedia streaming servers make use of the RTP control protocol (RFC 3605) to obtain feedback on the quality of data distribution. The protocol consists of periodically transmitting control packets (sender and receiver reports) to all participants in the session, using the same distribution mechanism as the data packets. This feedback may be directly useful to the server for control of adaptive encoding schemes, or alternatively may be used to diagnose faults in the distribution. Sending reception feedback reports to all participants helps in analysing whether problems are local or global. The ASR-based server transmits the audio data in its raw (uncompressed) form over a relatively benign LAN environment, where network resources are plentiful. This does not justify incurring the extra overhead that feedback entails, given that the probability of congestion, or other network problems is low in a LAN. Also, the feedback traffic imposes a burden on the server. Clients can join or leave the multicast groups anytime without notifying the server. The server remains oblivious of the number of clients ``tuned in,'' and its performance does not degrade as the number of the client increases. This makes the application very scalable and it can be used across campus or office networks.
We have constructed a sample client that receives packets from the network and plays them back via the local audio device. The client configures its audio hardware using the control packets that are sent periodically. In this way the client can start playing an audio stream by simply waiting for the first control packet. To differentiate between audio streams with varying parameters, the client compares the configuration parameters in every control packet to the parameters which have been previously used to setup the audio device. If there is a mismatch, it invokes the configuration setup routine before sending data to the device. Since the client receives raw data using the native encoding of the audio driver from the network, there is no decompression or decoding overhead incurred, thereby minimising the load on the client. The client is a simple program that is designed to run on a variety of platforms (our current implementation runs under OpenBSD 3.4 and RedHat Linux 7.3). The low computing requirements make future ports of the client to handheld devices a viable option.
The client application runs on the Ethernet Speaker, an embedded platform for audio playback over the LAN. The ES is a single board computer with a Pentium class CPU and 64Mb of RAM. The computer also includes a 8Mb flash memory as a boot medium, a PCMCIA slot for the network interface, and an audio card. The device boots a kernel that contains a RAM disk with the root partition and downloads its configuration from the network.
The design of the ES borrows a lot from development carried out as part of the embedded VPN gateway (Prevelakis et al, 2002) project which used a similar embedded design for a combined Firewall and Virtual Private Network gateway.
Channel selection is currently handled by a number of pushbuttons on the front panel of the machine. The interface attempts to mimic the preset buttons on car radios. The ES monitors a number of ``well known'' multicast addresses for audio information and identifies the pushbuttons that are associated with active audio streams. The user presses one of the these ``active'' buttons and the ES begins to playback the appropriate audio stream. This interface is rather limited and we plan to augment it by advertising available programs on a separate multicast channel. This information can include the name of the song being played and the multicast address of the channel The user will then be able to select the desired audio track by touching the name of the channel on a touch-sensitive screen.
SHOUTcast ( http://www.shoutcast.com) is a MPEG Layer 3 based streaming server technology. It permits anyone to broadcast audio content from their PC to listeners across the Internet or any other IP based network. It is capable of streaming live audio as well as on-demand archived broadcasts. Listeners tune in to SHOUTcast broadcasts by using a player capable of streaming MP3 audio e.g. Winamp for Windows, XMMS for Linux etc. Broadcasters use Winamp along with a special plugin called SHOUTcast source to redirect Winamp's output to the SHOUTcast server. The streaming is done by the SHOUTcast Distributed Network Audio Server (DNAS). All MP3 files inside the content folder are streamable. The server can maintain a web interface for users to selectively play its streamable content.
Since the broadcasters need to use Winamp with the SHOUTcast source plugin, they are limited to the formats supported by Winamp. Also it is not clear whether the system supports multiple concurrent Winamp server sessions on the same machine. Moreover, the server is designed to support outbound audio transitions (from the LAN to the outside) and thus does not support multicasting. Finally, the server is tied to the Windows platform.
The Helix Universal Server from RealNetworks is a universal platform server with support for live and on-demand delivery of all major file formats including Real Media, Windows Media, QuickTime, MPEG4, MP3 and more. It is both scalable and bandwidth conserving as it comes integrated with a content networking system, specifically designed to provision live and on-demand content. It also includes server fail-over capabilities which route client requests to backup servers in the event of failure or unexpected outages.
A similar application to the ASR is the Multicast File Transfer Protocol (MFTP) from Star-Burst Communications. MFTP is designed to provide efficient and reliable file delivery from a single sender to multiple receivers.
The concern that messages sent by clients participating in a multicast, can flood the server is mentioned in the Multicast FTP draft RFC (Miller at al, 1998). Under the MFTP protocol, after a file is multicast, clients contact the server to get missing or corrupted blocks of the file. MFTP aggregates these requests (NAKs) from each recipient, so that one NAK can represent multiple bad or dropped packets. The ASR design acknowledging that a small number of discarded packets is acceptable for audio transmissions, allows the clients to ignore bad, or lost packets. MFTP also uses a separate multicast group to announce the availability of data sets on other multicast groups. This gives the clients a chance to chose whether to participate in an MFTP transfer. This is a very interesting idea in that the client does not need to listen-in on channels that are of no interest to it. We plan to adopt this approach in the next release of our streaming audio server, for the announcement of information about the audio streams that are being transmitted via the network. In this way the user can see which programs are being multicast, rather than having to switch channels to monitor the audio transmissions. Another benefit from the use of this out-of-band catalogue, is that it enables the server to suspend transmission of a particular channel, if it notices that there are no listeners. This notification may be handled through the use of the proposed MSNIP standard (Fenner et al, 2002). MSNIP allows the audio server to contact the first hop routers asking whether there are listeners on the other side. This allows the server to receive overall status information without running the risk of suffering the ``NAK implosion'' problem mentioned earlier. Unfortunately, we have to wait until the MSNIP appears in the software distributions running on our campus routers.
Our motivation for doing this work was that we felt that existing audio players presented an inferior compromise between interoperability and reliability. In a LAN environment, there is no need for elaborate mechanisms for adapting to network problems. On the other hand the need for multiple connections to remote audio servers increases the load on the external connection points of the network and the work that has to be performed by firewalls, routers etc. Finally, the large number of special purpose audio players (each compatible with a different subset of available formats), alienates users and creates support and administrative headaches. By implementing the audio streaming server as an extension to the audio driver on the system running the decoding applications, we have bypassed the compatibility issues that haunt any general-purpose audio player. Our system confines the special-purpose audio players to a few servers that multicast the audio data always using the same common format.
The existence of a single internal protocol without special cases or the need for additional development to support new formats, allowed the creation of the ``Ethernet Speaker'' an embedded device that plays audio streams received from the network. The communications protocol also allows any client to ``tune'' -in or -out of a transmission, without requiring the knowledge or co-operation of the server. The protocol also provides synchronisation between clients broadcasting the same audio stream.