I have set out to build a SharePoint 2016 disaster recovery farm extending my home-based on-premises SharePoint 2016 farm.
My objectives
- Continue to build my networking, windows server and other infrastructure related skills. I come from an application development background.
- Build my hands on skills and knowledge with Azure IaaS;
- Azure Virtual networking, Site-to-site VPN
- Azure virtual machine management
- Gain in depth architecture and system administration knowledge of all the pieces that make up a disaster recovery farm using SQL AlwaysOn (async commit) approach.
- Understand performance/latency based on asynchronous commit to secondary database replica.
I used the following article as my primary source:
Plan for SQL Server AlwaysOn and Microsoft Azure for SharePoint Server 2013 Disaster Recovery
I tried my best to follow all the steps, but I approached them in a different order per my own DR design.
As a result, the following link are my raw notes and screen shots of some of my detailed steps in building the disaster recovery farm.
https://onedrive.live.com/redir?resid=D50B33B813A3693B!13901&authkey=!ANaDqU9cBkj36s0&ithint=file%2cdocx
My naming conventions are not perfectly consistent since I was building on the go. With these notes, it is my hope you can come away with some steps to a working solution.
The following is a summary of key steps in building my disaster recovery lab in Azure.
On-premises Home Network and Azure Network
My personal home network consists of a set of Hyper-V virtual machines with the physical host as a Windows 10 desktop PC. The specifications are Intel Core i5 4 processors, 16 GB RAM, Intel solid-state drive for the virtual machine disks, and D-Link DIR-826L router.
My on-premises environment:
- homedc virtual machine
domain controller and DNS
domain: rkhome.com
Decided to serve as a general file server. I don’t have enough RAM and CPU for a dedicated file and backup server. This is not the ideal server topology. - homesp virtual machine
SharePoint 2016 single server farm and SQL 2014SP1 database. SP is installed.
Single server farm instead of a desired 2-server topology because I don’t have enough CPU and RAM. - homerras virtual machine
Routing and Remote Access Server (RRAS)
Used to establish site-to-site VPN connectivity with an Azure virtual network. There are other options such as using a hardware VPN router. This server is not domain joined. - D-Link router
Port forwarding feature is leveraged to support site-to-site VPN connectivity.
Azure Disaster Recovery Site
The Microsoft cloud-based disaster recovery site.
- Virtual Network
Configured two subnets. One for the SharePoint farm and the other for the Gateway subnet for the site-to-site VPN. - rkdc virtual machine
domain controller and DNS (no domain controller promotion just yet)
Note: At least set this server as a static IP rather than dynamic IP in the Azure portal.
- rksp virtual machine
SharePoint 2016 single server (not installed yet) - rksql virtual machine
SQL 2014SP1 database server
Site-to-site VPN and DC Replica
Enable cross network connectivity between the on-premises home network and the Azure virtual network. The other option is using ExpressRoute, which is more suited for production scenarios for its private connection, higher bandwidth, better performance and reliability.
Port forwarding configured in the D-link home router to allow internet connectivity to the homerras server for a VPN connection.
Virtual Network Gateway
Serves as the cross-premises gateway connecting your workloads in the Azure Virtual Network to on-premises sites. This gateway has a public IP address accessible from the internet.
Local Network Gateway
Enables interaction with on-premises VPN devices represented in the Gateway Manager. Therefore, needs to be configured with the home router’s public WAN IP address. The port forwarding setup always communicates to the RRAS server as the VPN device.
Connection
Represents a connection between two gateways – the virtual network gateway and the local network gateway.
homerras RRAS Server
Configuration of an interface named as “Remote Router” to have the public IP address 40.114.x.x for the virtual network gateway.
Domain Controller replica on the Azure virtual network
Prerequisite: site-to-site VPN connection needs to be active.
Install a replica Active Directory domain controller (i.e. rkhome.com) in the Azure virtual network
Domain join rksp, rksql servers to rkhome.com
Any added DNS records and AD accounts will be synchronized between the two domain controllers.
In testing the VPN connection, any machine connected to the on-premises network was able to ping or RDP, with a domain account, into any other server in the Azure virtual network and vice versa.
SharePoint 2016, WSFC, and SQL Server AlwaysOn
SharePoint 2016
Installed on Azure rksp virtual machine as a single-server farm with mysites host and portal site collection. SharePoint 2016 is already installed on the on-premises farm before the start of this lab.
Windows Server Failover Cluster
Installed Windows Server Failover Cluster feature on homesp and rksql as they are database server roles.
Name: SPSQLCluster
IP Address: 192.168.0.102
File share cluster quorom is hosted on homedc. This quorom should be on a dedicated file server, but do not have enough memory resources for another VM.
Set Node weight = 1 on primary homesp node
SQL Server AlwaysOn
Enabled SQL AlwaysOn and asynchronous commit configuration. This is recommended for higher network latency due to the VPN connection and geographic distance between the two sites. Synchronous commit is recommended for network latency of <1ms for SharePoint. When I ping servers across the two environments (Toronto and North Central US), I get an average of about 75ms ranging from 30ms to 110ms.
The supported databases for asynchronous commit in the article Supported high availability and disaster recovery options for SharePoint databases (SharePoint 2013)
https://technet.microsoft.com/en-us/library/jj841106.aspx
The below databases below were deleted in rksql secondary before replication from homesp primary database instance.
Availability groups
- AG_SPContent
- MySites
- PortalContent
- AG_SPServicesAppsDB
- App Management
- Managed Metadata
- Subscription Settings
- User Profile
- User Social
- Secure Store
Configuration databases are farm specific. Search databases can be updated with a full crawl upon failover.
Availability Listener configuration for each availability group
- agl_spcontent1 for AG_SPContent
0.0.8 (on-premises)
192.168.0.103 (azure DR) - agl_spservice for AG_SPServicesAppsDB
0.0.9 (on-premises)
192.168.0.107 (azure DR)
Evaluating AlwaysOn Availability Group in Asynchronous Commit Mode
Failover Test
- Manual shut down IIS Web
sites of SharePoint
Simulate a failure event such as a IIS shut down - For each Availability Group, failover to secondary replica
Resume database movement - Adjust WSFC node voting rights
- Update DNS records of SharePoint sites to DR
Start IIS on original primary on-premises site
This can be repeated to failover once again to the on-premises site making it the primary once again.
Comments on Azure costs
Virtual Machines
- Domain controller and DNS – Basic A1 1 cpu 1.75GB RAM
- Left running
- SQL Server database server – Basic A1 2cpu 3.5GB RAM
- Left running
- SharePoint 2016 single-server – Basic A4 4CPU 7GB RAM
- Turned off in cold standby
- VPN Gateway
- ~$31CAD/month
- Pricing is based on time; however, I didn’t find a way to stop or pause usage to save on costs.
I approximate the cost of running the above resources to be $130CAD/month, if the SP VMs are stopped per cold standby methods.
Final Remarks
This has been a great learning experience as I understand how all the little pieces work together. Out in the enterprise world, disaster recovery tends to be lower in priority in a project roadmap or not at all. However, as the business criticality of a technology solution increases, so is the need for a DR solution. Hosting in Azure is a cost effective option since you are actually paying for what you use, especially in cold standby scenarios. Leveraging Azure regions in geographically remote areas are appropriate for mitigating widespread disaster situations such as hurricanes, mass power outages, earthquakes, floods or even outbreaks that can affect a data centre’s operability.
In technology, something’s you do not really know until you build it with your own hands – learning is by doing.
Pingback: Building A SharePoint 2016 Disaster Recovery Farm Lab on Azure - How to Code .NET