FPGA-based Audio-over-IP

Ethernet on Microcontrollers is available without problems: using FreeRTOS or the Arduino-libs is always a good start. But for FPGAs? I was shocked how expensive Ethernet is for FPGAs - but I found a good solution...

In 1997 I bought my first NE2000-compatible ISA-Ethernet-Card for my Windows 95 computer. Why? Well, not that my 13-year-old me had any need, but because on the computer of my cousin a green network-symbol was displayed in the device-manager and so I wanted such a device, too. The installation was easy: plugin the card, Windows 95 detected the card and installed the drivers for me. We then pulled a BNC-network-cable between our two houses and played StarCraft and HalfLife together. That is about 30 years ago.

Since then I programmed lot of software that uses TCP- or UDP-connections, but everytime using a Windows- or Linux-based computer or using a microcontroller with ICs like the SPI-based ENC28J60, that contains an Ethernet MAC and PHY. A Media Access Control (MAC) is that part, that is close to your software and communicates to a connected PHY. The PHYsical layer then contains the magnetics and is doing the low-level stuff for IP. Most of the time it is communicating via an MDIO-interface.

Why talking about Ethernet in 2025?

OK, but why am I writing this in 2025 when I could buy an Ethernet-card even back in 1997? Well, a couple of weeks ago I started to get in touch with Ethernet in my FPGA designs. First I used the Intel TSE, a Triple-Speed-Ethernet-MAC which has a pretty good documentation as I'm working with Intel/Altera-FPGAs in Quartus Prime Lite. It took me only a couple of hours to implement some working logic that communicates with my Marvel Ethernet PHY on an old eval-board with a Cyclone III. But when loading the logic to the FPGA a timer started showing me, that I only have a grace-periode to test my design - afterwards I would have to pay some money.

Thats basically OK, but searching around the internet I found, that this Intel TSE IP package is way above 1000€ - wow. Searching for alternatives, I realized, that even the competition of Intel demand high prices, some between 5000 and 8000€ for a single license. This is out of scope for DIY-projects.

A solution is in sight

On my search I stumbled across a promising GitHub-repository: Philip Kerling published his Bachelor-Thesis including well documented VHDL-code for his 1 Gbps EthernetMAC in 2015. Originally this EthernetMAC was designed for an Xilinx Spartan 6, but with minor changes the logic compiled in my Quartus Prime Lite. Hurray, I've found a open-source-solution! But different to commercial products, Philipps EthernetMAC offered "only" the most important parts to initialize a EthernetPHY. All high-level-functions were missing:

enter image description here Source: Bachelor-Thesis of Philipp Kerlin

Putting everything together it took only a short while until my Marvel 881119 Ethernet-PHY showed some activity on the LEDs. But sending real data was still missing...

So, what do we really need for Audio-over-IP?

Thinking about my original goal, transmitting audio-data over ethernet, I thought about the protocols, I really need to implement. My main-goal was to send UDP-packets containing low-latency audio-samples to a destination-computer. But Windows will not accept UDP-streams without a proper ARP-protocol-implementation in the sender. So next to UDP I had to implement ARP. Then it would be great to PING the sender to check the connection, so next to UDP and ARP I needed ICMP - and everything in pure logic, as I did not want to implement a SoftCore-Microcontroller or an additional real processor.

For my first tests I added some static routes into my computer, so ARP and ICMP were not the most urgent things:

# add static route under Windows
netsh interface ipv4 add neighbors "Name of Ethernet Conn" IpAddressOfDest MacAddressOfDest
# removing this route can be done using
arp -d IpAddressOfDest

OK, but still I had to implement Ethernet-Packets and UDP from scratch.

Ethernet- and UDP-Packets

The idea was to send a new Ethernet-packet to the destination IP- and MAC-Address when receiving one or more audio-samples. So first I had to implement a state-machine for the most important communication:


architecture Behavioral of udp_packet is
begin
    process (tx_clk)
    begin
        if (falling_edge(tx_clk)) then
            if ((frame_start = '1') and (zframe_start = '0') and (s_SM_Ethernet = s_Idle)) then
                -- prepare begin of packet
            elsif (s_SM_Ethernet = s_CalcChecksum) then
                -- calculate checksum for IP-Header
            elsif (s_SM_Ethernet = s_Start) then
                -- wait until MAC is ready again
            elsif (s_SM_Ethernet = s_Transmit) then
                -- wait until previous byte is sent
                -- then send next byte until EndOfPacket
            elsif (s_SM_Ethernet = s_End) then
                -- disable transmitter
            end if;
        end if;
    end process;
end Behavioral;

For sending the individual bytes I implemented a byte-counter within the s_Transmit-state. Another thing was quite challenging: calculating the checksum over the full package in time as the checksum for the Ethernet-Frame is not at the end, but at the beginning within the Ethernet-Header. But one step at time.

The following data has to be in place to send a new ethernet-packet:

MAC-Header (14 bytes long)
IP-Header (20 bytes long)
UDP-Header (8 bytes long)
UDP-Payload (depends on data)

The MAC-Header is quite static and contains information about the source-MAC-Address, destination-MAC-Address and the used protocol, here: "IP":


-- MAC HEADER (14 bytes)
-- fill MAC-Header with desired values
udp_frame(0) <= dst_mac_address(47 downto 40); -- MSB contains typical left side of MAC
udp_frame(1) <= dst_mac_address(39 downto 32);
udp_frame(2) <= dst_mac_address(31 downto 24);
udp_frame(3) <= dst_mac_address(23 downto 16);
udp_frame(4) <= dst_mac_address(15 downto 8);
udp_frame(5) <= dst_mac_address(7 downto 0);

udp_frame(6) <= src_mac_address(47 downto 40); -- MSB contains typical left side of MAC
udp_frame(7) <= src_mac_address(39 downto 32);
udp_frame(8) <= src_mac_address(31 downto 24);
udp_frame(9) <= src_mac_address(23 downto 16);
udp_frame(10) <= src_mac_address(15 downto 8);
udp_frame(11) <= src_mac_address(7 downto 0);

-- IP Protocol
udp_frame(12) <= x"08"; -- type [0x0800 = IP Protocol]
udp_frame(13) <= x"00";

Within the IP-Header there is a bit more life: here the length of the complete IP-header including the UDP-header and payload and the IP-addresses for the source and destination are placed:


-- IP HEADER (20 bytes)
udp_frame(14) <= x"45"; -- b14 = version (4-bit) | internet header length (4-bit) [Version 4 and header length of 0x05 = 20 bytes]
udp_frame(15) <= x"00"; -- differentiated services (6-bits) | explicit congestion notification (2-bits)
udp_frame(16) <= std_logic_vector(to_unsigned(IP_HEADER_LENGTH + UDP_HEADER_LENGTH + UDP_PAYLOAD_LENGTH, 16)(15 downto 8)); -- total length without MAC-header: entire packet size in bytes, including IP-header and payload-data. The minimum size is 46 bytes of user data (= 0x2e, header without data) and the maximum is 65,535 bytes
udp_frame(17) <= std_logic_vector(to_unsigned(IP_HEADER_LENGTH + UDP_HEADER_LENGTH + UDP_PAYLOAD_LENGTH, 16)(7 downto 0)); -- 20 bytes IP-header + 8 bytes UDP-header + 18 bytes UDP-payload = 46 bytes = 0x002e
udp_frame(18) <= std_logic_vector(to_unsigned(packet_counter, 16))(15 downto 8); -- identification (primarily used for uniquely identifying the group of fragments of a single IP datagram) [0x0000 will be ignored by windows, so we set the packet_counter to this value in the next step]
udp_frame(19) <= std_logic_vector(to_unsigned(packet_counter, 16))(7 downto 0);
udp_frame(20) <= x"00"; -- flags (3-bits) | fragment offsets (13-bits)
udp_frame(21) <= x"00";
udp_frame(22) <= x"80"; -- time to live (0x80 = 128)
udp_frame(23) <= x"11"; -- b23 = protocol (0x01 = ICMP, 0x06 = TCP, 0x11 = UDP)
udp_frame(24) <= x"00"; -- header checksum (16-bit ones' complement of the ones' complement sum of all 16-bit words in the header)
udp_frame(25) <= x"00";

udp_frame(26) <= src_ip_address(31 downto 24); -- MSB contains typical "192"
udp_frame(27) <= src_ip_address(23 downto 16);
udp_frame(28) <= src_ip_address(15 downto 8);
udp_frame(29) <= src_ip_address(7 downto 0);

udp_frame(30) <= dst_ip_address(31 downto 24); -- MSB contains typical "192"
udp_frame(31) <= dst_ip_address(23 downto 16);
udp_frame(32) <= dst_ip_address(15 downto 8);
udp_frame(33) <= dst_ip_address(7 downto 0);

The UDP-header is a very easy-thing: it only contains the desired UDP-ports and the size of the UDP-payload, as well as the UDP-checksum:


-- UDP HEADER (8 bytes)
udp_frame(34) <= src_udp_port(15 downto 8);
udp_frame(35) <= src_udp_port(7 downto 0);
udp_frame(36) <= dst_udp_port(15 downto 8);
udp_frame(37) <= dst_udp_port(7 downto 0);
udp_frame(38) <= std_logic_vector(to_unsigned(UDP_HEADER_LENGTH + UDP_PAYLOAD_LENGTH, 16)(15 downto 8)); -- length (length of this UDP packet including header and data. Minimum 8 bytes)
udp_frame(39) <= std_logic_vector(to_unsigned(UDP_HEADER_LENGTH + UDP_PAYLOAD_LENGTH, 16)(7 downto 0));
udp_frame(40) <= x"00"; -- checksum (will be placed later)
udp_frame(41) <= x"00";

The UDP-payload itself could be very easy like the following "Hello World":


-- UDP PAYLOAD (18 bytes)
udp_frame(42) <= x"48"; -- H
udp_frame(43) <= x"45"; -- E
udp_frame(44) <= x"4c"; -- L
udp_frame(45) <= x"4c"; -- L
udp_frame(46) <= x"4f"; -- O
udp_frame(47) <= x"20"; --  
udp_frame(48) <= x"57"; -- W
udp_frame(49) <= x"4f"; -- O
udp_frame(50) <= x"52"; -- R
udp_frame(51) <= x"4c"; -- L
udp_frame(52) <= x"44"; -- D
udp_frame(53) <= x"21"; -- !
udp_frame(54) <= x"20"; --  
udp_frame(55) <= x"30"; -- 0
udp_frame(56) <= x"31"; -- 1
udp_frame(57) <= x"32"; -- 2
udp_frame(58) <= x"33"; -- 3
udp_frame(59) <= x"34"; -- 4

You can have a look into the full code in the file udp_packet.vhd of the FPGA_Ethernet-project on GitHub.

Calculate the Checksums

The state-machine will stay in the state s_CalcChecksum as long as the IP-Header-checksum or the UDP-checksum is calculated. This is done using the following code:


-- calculate checksum for IP-Header
if (checksum_byte_count < IP_HEADER_LENGTH) then
    Word                    := udp_frame(MAC_HEADER_LENGTH + checksum_byte_count) & udp_frame(MAC_HEADER_LENGTH + checksum_byte_count + 1);
    checksum_tmp            <= checksum_tmp + resize(unsigned(Word), 32);
    checksum_byte_count     <= checksum_byte_count + 2; -- we are reading two bytes at once
else
    -- checksum is calculated -> make sure that we have only 2-byte checksum and add carryover above 16th bit to 16-bit checksum
    if (checksum_tmp(31 downto 16) > 0) then
        checksum_tmp <= resize(checksum_tmp(15 downto 0), 32) + resize(checksum_tmp(31 downto 16), 32);
    else
        checksum                <= x"ffff" - checksum_tmp(15 downto 0);
        calculating_checksum    <= '0';
    end if;
end if;

So the calculation of the checksum is running until the result is within a 16-bit value between 0x0000 and 0xFFFF. In the end the checksum is not a regular CRC16 or CRC32, but a continuous addition of individual Words of the ethernet-data, condensed to a 16-bit value and inverted at the end.

More protocols: ARP and ICMP

It took a while to optimize the timing of the logic to send valid Ethernet-packets even at high packet-rates, but that was a nice feeling when my packets showed up in Wireshark on my desktop-computer. It would go beyond the scope of this blog-post if I describe everything in detail, but once UDP-packets could be sent it was not so hard to implement ARP and ICMP as well.

The only thing I had to add was a signal-router, that keeps track of the incoming PING- and ARP-requests to answer with the right order and correct packets. Here I had to implement some kind of RAM-module. My eval-board had some SD-RAM, but for the most basic functions I sacrified some of the FPGA-logic as RAM. As ARP- and ICMP-packets are quite small, I implemented a RAM with only 100 bytes that fit into the FPGA without large logic-demand.

Some of the receiver-functions can be found in the file ethernet_packet_parser.vhd and the used RAM-module is placed in the file eth_ram.vhd. During coding I realized, why companies are selling these kind of Ethernet-functions for more than 1000,-€... but I was close to a working framework.

Finally: Audio over IP

Sending "Hello World"-messages using UDP at high packet-rates was nice, but I wanted to hear some audio. So I collected 24-bit audio-samples into a 32-bit sample-buffer for better compatibility. The size of the buffer can be adjusted within the VHDL-code, but is set to 16 samples for the first tests. With a maximum payload of 1460 bytes (= 365 4-byte-samples), we could transmit 32 channels with a buffer of 11 samples or - when sending only 3 bytes per channel - we have a payload of 486 samples, which means we can transmit 48 channels with a buffer of 10 samples.

The audio-samples are collected using a memory-pointer that is incremented on each rising edge of the sync-signal:


-- copy audio-data to buffer-array when we are receiving new samples (audio_sync)
if ((audio_sync = '1') and (zaudio_sync = '0')) then
    -- rising edge of audio_sync -> read audio-data
    if (audio_buffer_ptr < (AUDIO_BUFFER_LENGTH - (2 * AUDIO_CHANNELS * BYTES_PER_SAMPLE))) then
        -- increment buffer-pointer by 8 bytes
        audio_buffer_ptr <= audio_buffer_ptr + (AUDIO_CHANNELS * BYTES_PER_SAMPLE); -- we are storing AUDIO_CHANNELS * 4 bytes
    elsif (audio_buffer_ptr = (AUDIO_BUFFER_LENGTH - (2 * AUDIO_CHANNELS * BYTES_PER_SAMPLE))) then
        -- buffer-pointer has reached the last element
        audio_buffer_ptr <= audio_buffer_ptr + (AUDIO_CHANNELS * BYTES_PER_SAMPLE); -- we are storing AUDIO_CHANNELS * 4 bytes
    else
        -- next buffer-pointer would be out of scope, so reset to first element
        frame_start <= '1'; -- set flag to read buffer when state-machine enteres s_Idle again
        audio_buffer_ptr <= 0; -- reset to first element
    end if;
    
    sample_buffer(audio_buffer_ptr)     <= x"00"; -- LSB of audiosample
    sample_buffer(audio_buffer_ptr + 1) <= audio_data_l(7 downto 0);
    sample_buffer(audio_buffer_ptr + 2) <= audio_data_l(15 downto 8);
    sample_buffer(audio_buffer_ptr + 3) <= audio_data_l(23 downto 16); -- MSB of audiosample
    sample_buffer(audio_buffer_ptr + 4) <= x"00"; -- LSB of audiosample
    sample_buffer(audio_buffer_ptr + 5) <= audio_data_r(7 downto 0);
    sample_buffer(audio_buffer_ptr + 6) <= audio_data_r(15 downto 8);
    sample_buffer(audio_buffer_ptr + 7) <= audio_data_r(23 downto 16); -- MSB of audiosample
end if;

That's all. The data is sent as regular UDP-packets to the receiving destination and can be seen either in Wireshark as incoming traffic or in a python-script.

Audio-Receiver

As I like Embarcadero Delphi more than Python, I've created a program that receives the incoming UDP-packets and puts the received audio-samples into a ring-buffer. There it validates the packet-counter and casts the received four bytes as a new 32-bit audio-sample:


procedure udpserverUDPRead(AThread: TIdUDPListenerThread; const AData: TIdBytes; ABinding: TIdSocketHandle);
begin
  // first check for the expected header = NDNG (short for "NOEDING")
  if (AData[0] = $4e) and (AData[1] = $44) and (AData[2] = $4e) and (AData[3] = $47) then
  begin
    // we received a good UDP-Payload-Header
    udpPacketSize := length(AData);
    packetCounter := (AData[4] shl 8) + AData[5];

    // optional check if this packet is in the expected order
    
    // now get some data from the packet
    channelCount := AData[6] and $3f; // in this byte we transmit the number of channels
    case ((AData[6] and $c0) shr 6) of
      0: sampleRate := 44100;
      1: sampleRate := 48000;
      2: sampleRate := 96000;
      3: sampleRate := 192000;
    end;
    samplesPerPacket := AData[7];
    
    // now restore audio-samples
    for i:=0 to samplesPerPacket-1 do
    begin
      // first 8 bytes are for header
      for c:=0 to channelCount-1 do
      begin
        // copy 32 bit of audio-data into ring-buffer
        ringbuffer[c][ringbufferWritePointer] := PInteger(@AData[8 + i*8 + 4*c])^;
      end;
      
      ringbufferWritePointer := ringbufferWritePointer + 1;
      if (ringbufferWritePointer >= audioBufferSize) then
      begin
        // wrap the pointer around
        ringbufferWritePointer := 0;
      end;
    end;
  end;
end;

There is also an option to write the incoming data to a valid Wave-file, so that the audio-data can be used for some post-processing.

Outlook

This project is a nice basis for some more evolved functions as it already recognizes the type of sent audio, number of channels, etc. and the transmission happens in realtime (or very close to it with only 200 to 320 microseconds latency depending on the number of buffered samples).

My goal is to implement a 48 channel audio-sender for regular audio between my FPGA-audio-devices using regular ethernet. I'm aware that there are professional solutions available out there (Dante, etc.) but these are closed-source or hard to implement solutions. My Windows-app is already ready for some more data: enter image description here

Chris.Dev.Blog