DRA Overview

The Dynamic Routing Agent (DRA) provides load balancing across multiple PCRFs and session binding for diameter to ensure a specific PCRF handles a full control dialogue for a given session. It also handles the fail over between PCRFs in case of network failures.

The DRA is a functional element that ensures that all Diameter sessions established over the Gx, S9, Gxx and Rx reference points for a certain IP-CAN session reach the same PCRF when multiple and separately addressable PCRFs have been deployed in a Diameter realm.

The DRA layer acts as a “front end” for the PCRF, hiding the multiple internal PCRF servers which provide the main logic of the PCRF. The DRA layer provides load sharing and high availability capabilities for the PCRF’s Diameter interfaces.

DRANetworkContext

DRA- PCRF High Availability

The DRA layer is highly available, and will be run as two active/active processes on different blades (or nodes) in a cluster. A TCP connection exists between the two DRAs to replicate PCRF binding information between them, with a restarting DRA first having to obtain a complete copy of the information (known as a “bulk transfer”) from the other DRA. Multiple PCRF server processes will be run per blade with the DRA layer hiding these multiple 3rd party PCRF servers from Diameter clients. Although external Diameter clients will be configured to connect to the primary DRA and the other DRA, it is assumed that there is no real control over which connection will be used at any given time.

In general, clients switch to the secondary DRA if the primary link fails, and back to primary once it recovers. DRA PCRF/subscriber binding information elements are synchronised between primary and secondary DRAs, and both have Diameter connections with each PCRF.

Using the PCRF server status from both DRA processes, each DRA implements an algorithm to decide which PCRF server should handle each new request. Every time a PCRF binding is allocated by a DRA, the information is sent immediately to the other DRA. If PCRF binding information is available for a subsequent request, this information is used to route the request or answer to the correct Diameter peer. When the PCRF binding information is no longer needed, the DRA removes the information and informs the other DRA.

DRA Overview

DRA Information Synchronisation

A DRA which is restarting firstly has to obtain a complete copy of the session state information from the other DRA. If a lot of information has to be transferred, the bulk transfer could take a considerable period of time.

Each DRA is configured with a list of servers within the PCRF. The DRA has basic information about each Diameter server, such as node IP address and port. For each configured server, the DRA stores information about the server status as part of an “available server list”. Each DRA passes its own view of the server availability to the other DRA.

After the bulk transfer is complete, the DRAs will share status information with each other.

DRA Switching

In general, clients switch to the secondary DRA if the primary link fails and back to primary once it recovers. PCRF/subscriber binding information elements are synchronised between primary and secondary DRAs, and both have Diameter connections with each PCRF.

The main failure case that can occur in the system is when a PCRF node fails. In this scenario, the DRA starts using a link to (another) secondary PCRF. If there is a failure the following can occur:

DRA removes the failed node from its routing list.
Mid-session REQs for the failed node are rejected by DRA.
If the primary link recovers, then requests start arriving on the primary link again; responses to these requests are sent back on the primary link.
Responses to requests already received on the secondary link are sent on the secondary link.

Another failure case that can occur in the system is where a DRA node fails. In this scenario, the clients start using their link to a secondary DRA. There are a few points to note:

The secondary DRA determines the correct PCRF from synchronised routing data and thus routes the REQs.
The PCRF now sends REQs to the secondary DRA (PCRF needs to detect the new source of incoming messages).
If the primary link recovers, then requests start arriving on the primary link again; the PCRF should also revert at this point to sending outgoing messages to the primary DRA.