​SQL 2017 Always-on AG Pt. 7: File Share Witness

​Now that we have the Windows failover cluster service installed and configured with a management point, we need to configure a witness. A witness is a 'third party' that enables monitoring of the cluster node status and assist with failing over SQL services. The witness can live either in the cloud (Azure) or use a generic file share that could reside on a NAS appliance or a Windows server. The file share witness must NOT reside on either SQL node, as that would defeat the purpose of having a witness. In my lab I deployed a bare bones Windows Server VM to host the FSW. 

​Create a File Share

​1. On a Windows member server (not either SQL server) open Server Manager, go to File and Storage Services, click on Shares, then from the Tasks menu select New Share. If you don’t have that option, add the File Server Role and wait for the installation to complete. No reboot is needed.

2. Select the SMB Share – Quick file share profile.

3. Select the appropriate volume the share will be created on.

4. Enter the new share name. ​I suggest this format: <Cluster name>-FSW (e.g. SQL2017CLA-FSW). ​Make note of the remote share path, as we will need this in a couple of minutes.

5. Enter a description in the format of: <Cluster Name> Cluster File Share Witness.

6. Uncheck allow caching of share and enable encrypt data access.

7. Customize the permission, disable inheritance and remove all inherited permissions.

8. Give the cluster computer object (e.g. SQL2017CLA) full control. If you want, you could also give administrators access so they can peek inside. Make sure to enable the search for 'computer' objects when you 'Select a principal' or it won't find your computer account.

​9. Finish the wizard and wait for the share to be created. If you get an access denied message, re-run the wizard with the same settings and see if a second attempt will work.​

​Fileshare Witness Cluster Configuration

1. On either SQL server launch the Windows Failover Cluster Manager.

2. Right click on the root cluster object (e.g. SQL2017CLA), select More Actions and then click Configure Cluster Quorum Settings.

3. Select Select the quorum witness.

4. Select Configure a file share witness.

5. Enter the file share path you made note of from above. Click through the remainder of the wizard and verify the FSW was successfully configured.

6. Verify Quorum configuration is now using a file share witness. Note that you only need to do these steps once per cluster.

​Summary

In this post we configured the Windows Cluster service with a file share witness. The FSW is needed to properly manage node failover. The FSW can be co-located with other services on another server, be a share from a NAS appliance, or use the cloud (Azure). It cannot be created on either SQL node.

Now that the Windows cluster services are fully configured, we will return to configuring SQL. The next installment will configure the pre-reqs for setting up an AAG, and then configure one AAG. You can find Part 8 here (coming).

SQL 2017 Installation Series Index

SQL 2017 Always-on AG Pt. 1: Introduction

SQL 2017 Always-on AG Pt. 2: VM deployment

​SQL 2017 Always-on AG Pt. 3: Service Accounts

SQL 2017 Always-on AG Pt. 4: Node A SQL Install

SQL 2017 Always-on AG Pt. 5: Node B SQL Install

SQL 2017 Always-on AG Pt. 6: Cluster Configuration

SQL 2017 Always-on AG Pt. 7: File Share Witness

​​​​​SQL 2017 Always-on AG Pt. 8: ​AAG Setup (Coming)

​​​​SQL 2017 Always-on AG Pt. 9: Kerberos (Coming)

​SQL 2017 Always-on AG Pt. 10: SSL Certificates (​Coming)

​SQL 2017 Always-on AG Pt. 11: Max Mem & Email Alerts (Coming)

​SQL 2017 Always-on AG Pt. 12: Maintenance Jobs (Coming)

SQL 2017 Always-On AG Pt. 6: Cluster Configuration

Now that SQL 2017 is installed on both nodes, we need to configure the Windows Cluster service. Although SQL AAGs don't use the 'traditional' clustering technology of shared SCSI disks and a quorum disk, the clustering service is required to manage the state and failover the AAG instances.

For this procedure we will need one new IP address and a cluster DNS name. Get those ready before proceeding. The cluster name is only used for management purposes, and is NOT related to the SQL listener name. If DHCP is active on the server subnet the wizard will automatically pull an IP.

Microsoft has a technology called "Cluster Aware Updating" (CAU) which orchestrates the installing of Windows patches on hosts that are a part of a cluster. However, I think its utility is mostly aimed at Hyper-V hosts/clusters and less so at enterprise applications such as SQL. So I won't cover configuring CAU in this series.

​Cluster Role Installation

1. On the first SQL server launch Server Manager and select Add Roles and Features.

2. Select Role-Based or Feature-Based Installation.

3. Select the local server when prompted.

4. Skip the Server Roles page and on the Features page check Failover Clustering. When prompted, add the required features.

6. Continue through the rest of the wizard and wait for the installation process to complete.

7. Install Windows Failover Clustering on the second SQL node.

​Cluster Validation

1. On Node A launch the Failover Cluster Manager, and in the left pane right click on Failover Cluster Manager and select Validate Configuration.

2. Enter the two hostnames of the SQL servers you are configuring.  Run all of the tests.

3. Review the cluster report for any errors. A warning regarding non-redundant NICs is normal, and can be ignored. If there are any other warnings/errors (such as a pending reboot due to patching), take the required action to remediate.

4. On the Validation summary screen check the box next to "Create a cluster now using the validated hosts..." and click Finish.

​Cluster Configuration

1. ​Enter the Cluster name (e.g. SQL2017CLA). 

​2. In my case DHCP was active on the server subnet, so the wizard automatically pulled an IP from the pool and assigned it to the cluster hostname. If DHCP is unavailable the wizard will prompt you for an IP. 

3. Validate that all of the cluster information is correct and UN-check the box next to the "Add all eligible storage to the cluster" option and click Next. Wait for the cluster to be built.

Note: If you forget to UN-check​ the storage box below, the cluster service may claim all of your SQL disks and they could disappear from Explorer. If they have disappeared, then go into the cluster manager and remove all drives as a cluster resource. Then open the Computer Manager and online each of the drives. Reboot the SQL server if you had to do this procedure.

​4. Review the summary screen to make sure all is well. You can safely ignore any warnings regarding a disk witness. We will configure a File Share Witness (FSW) in the next post.

5. Do a forward and reverse DNS lookup of the new cluster name to verify A and PTR records were created. Correct any issues.

​Summary

In this installment we configured the Windows failover cluster service, and created a new cluster. This cluster name is used only for management purposes, and not used for the SQL listener. Next up in Part 7 we will configure the File Share Witness (FSW).

SQL 2017 Installation Series Index

SQL 2017 Always-on AG Pt. 1: Introduction

SQL 2017 Always-on AG Pt. 2: VM deployment

​SQL 2017 Always-on AG Pt. 3: Service Accounts

SQL 2017 Always-on AG Pt. 4: Node A SQL Install

SQL 2017 Always-on AG Pt. 5: Node B SQL Install

SQL 2017 Always-on AG Pt. 6: Cluster Configuration

SQL 2017 Always-on AG Pt. 7: File Share Witness

​​​​​SQL 2017 Always-on AG Pt. 8: ​AAG Setup (Coming)

​​​​SQL 2017 Always-on AG Pt. 9: Kerberos (Coming)

​SQL 2017 Always-on AG Pt. 10: SSL Certificates (​Coming)

​SQL 2017 Always-on AG Pt. 11: Max Mem & Email Alerts (Coming)

​SQL 2017 Always-on AG Pt. 12: Maintenance Jobs (Coming)


SQL 2017 Always-On AG Pt. 5: Node B SQL Installation

​This is the fifth installment of the SQL 2017 AAG series, where we do an unattended install of SQL on node B. We will use a modified configuration INI file that was generated from the node A installation. This reduces human error and makes your installs repeatable. With the use of gMSAs, it makes the unattended a bit easier since we don't have to supply passwords for the service accounts. If you are using "normal" AD service accounts two command line switches are needed to supply the proper credentials. 

​Unattended SQL 2017 Installation

​1. Create a folder on the second SQL server and copy the ConfigurationFile.INI​ from node A to the ​this folder on node B. I used C:\SQL. 

2. Open your favorite editor and change the specified parameters to the new values in the images below​​​. The QUIET and UIMODE parameters should be commented out, as shown.

​3. Mount the SQL 2017 Enterprise Edition ISO to the second SQL server. Open an elevated command prompt and change directory to the root of the SQL ISO. Enter the following command:

setup /configurationfile=c:\SQL\ConfigurationFile.ini /iacceptsqlserverlicenseterms

​Note: If you are not using gMSAs, you will need two switches to supply the credentials for the service accounts:

​​​/SQLSVCPASSWORD="YourPassword"
/AGTSVCPASSWORD="YourPassword"

​4. Wait a few minutes for the installation to complete. After the install has completed, you can list the services running on the server and validate that SQL is using the proper gMSA service accounts.

​5. We aren't quite done yet, as you have to now install the SQL management tools just as we did on node A. Launch the SQL installer from the ISO, click on the SQL management tools link and download/install as you did before.

6. After the SQL tools have been installed, launch the SQL Server Management Studio and ensure you can login to the SQL server. If you have login issues, reboot the VM and try again.

​Summary

​Using the SQL 2017 unattended installation is very straight forward. Only a few parameters needed to be changed from the INI that the installer creates. Next up in Part 6 we will configure Windows clustering.

SQL 2017 Installation Series Index

SQL 2017 Always-on AG Pt. 1: Introduction

SQL 2017 Always-on AG Pt. 2: VM deployment

​SQL 2017 Always-on AG Pt. 3: Service Accounts

SQL 2017 Always-on AG Pt. 4: Node A SQL Install

SQL 2017 Always-on AG Pt. 5: Node B SQL Install

SQL 2017 Always-on AG Pt. 6: Cluster Configuration

SQL 2017 Always-on AG Pt. 7: File Share Witness

​​​​​SQL 2017 Always-on AG Pt. 8: ​AAG Setup (Coming)

​​​​SQL 2017 Always-on AG Pt. 9: Kerberos (Coming)

​SQL 2017 Always-on AG Pt. 10: SSL Certificates (​Coming)

​SQL 2017 Always-on AG Pt. 11: Max Mem & Email Alerts (Coming)

​SQL 2017 Always-on AG Pt. 12: Maintenance Jobs (Coming)

SQL 2017 Always-On AG Pt. 4: Node A SQL Installation

​This is Part 4 of the SQL 2017 Always-On Availability Group installation series. In this post we will manually install SQL 2017 on the first node of the cluster. We will then capture an answer file for SQL, which will be used ​for an unattended SQL installation on Node B (Part 5).

​SQL 2017 Node A Installation

​1. Mount the SQL 2017 Enterprise Edition ISO to the first VM and launch the installer.

2. On the left side click on Installation, then click on the top option "New SQL server stand-alone.."

3. On the Product Key screen accept/enter your key.

4. On the License Terms screen check the box to accept the terms.

5. On the ​Microsoft Update​ screen configure as you see fit.

6. On the ​Install Rules​ screen verify everything is green, with the possible exception of the Windows firewall which may be yellow (normal).

7. ​​​​​​On the ​Feature Selection​ screen select at least the Database Engine Services​​​​ ​option. If you need additional features, select them. At the bottom of the window change the SQL binary installation paths as needed.​​​

​8. On the ​Instance Configuration​ screen configure your instance as needed. For this series I'm leaving the default settings.

9. On the ​Server Configuration​ set the agent and database engine accounts to their respective gMSA. ​​​​​​No password is needed since that is handled by Windows. There's no need to check the box for granting Volume Maintenance privileges as we manually did that in a previous post by configuring user rights.

​10. On the ​Database Engine Configuration​ screen select the authentication mode and add the AD SQL Admin group you are using to the list of administrators.​​​ Windows authentication mode is the most secure, so only select mixed mode if you know an application requires it.

​11. Configure the SQL database engine directories as desired. I'm using the drives covered in previous posts.

​12. TempDB configuration will vary wildly between instances. Some applications make very heavy use of TempDB, while others barely use it. SQL 2017 also bases the initial number of TempDB files on the quanity of CPUs assigned to the VM. The key here is to make sure you have an adequate number of TempDB files, and if you are using Nutanix storage, put them on separate disks. Make sure you set all the proper paths including the Log directory. 

​13. On the ​Ready to Install ​screen copy the configuration file path to the clipboard, as we will need it later.​​​ ​Once the installation starts, browse to the INI file and copy it to a safe location so we can use it in the next post.

​14. After the installation completes for the database engine, we now need to install the SQL Server Management Tools. Launch the SQL installer again but this time select ​Install SQL Server Management Tools​.​​​ This will actually redirect you to a live web page where you have to download the latest 800MB+ tool package. Install the package. 

15. From the Start menu launch the SQL Server Management Studio and validate that you can successfully login. If your login does not work reboot the SQL VM and try again.

​Summary

​In this post we installed the first instance of SQL 2017 using the manual method. We were careful to select the proper paths for all the files, and copied the SQL configuration INI file to a safe place. In Part 5, we will use this INI for an unattended installation on the second node. This reduces human error and ensures both nodes are configured the same. Unlike SQL 2014 and earlier, the management tools are a separate download and install process. 

SQL 2017 Installation Series Index

SQL 2017 Always-on AG Pt. 1: Introduction

SQL 2017 Always-on AG Pt. 2: VM deployment

SQL 2017 Always-on AG Pt. 3: Service Accounts

SQL 2017 Always-on AG Pt. 4: Node A SQL Install

SQL 2017 Always-on AG Pt. 5: Node B SQL install

SQL 2017 Always-on AG Pt. 6: Cluster configuration

​SQL 2017 Always-on AG Pt. 7: Fileshare Witness (coming)

​​​​​SQL 2017 Always-on AG Pt. 8: Max memory and Email (coming)

​​​​SQL 2017 Always-on AG Pt. 9: SQL Maintenance (coming)

​SQL 2017 Always-on AG Pt. 10: AAG Setup (coming)

​SQL 2017 Always-on AG Pt. 11: Kerberos (coming)

SQL 2017 Always-On AG Pt. 3: Service Accounts

This is Part 3 of the SQL 2017 Always-On Availability group series where we setup two service accounts and a security group. One account is for the database engine and the other is for the SQL agent. In order for Kerberos to work properly the database engine account must be Active Directory based. We will also be observing the rule of least privilege. The less privileges the accounts have the more secure you are. So these accounts won’t be given local administrator rights. You don’t want your SQL service running with admin rights!

The service account that most people are familiar with are a standard Active Directory user, with a complex password, that is used for a specific application such as SQL. Active Directory service accounts are required if you want to use Kerberos authentication. However, these standard service accounts have some disadvantages. ​The primary issue is the lack of automatic password management. An administrator manually sets the password, and should periodically change it for security reasons. Restarting the SQL service is needed after the password is changed. 

A few years ago Microsoft introduced "Group Managed Service Accounts". The gMSA accounts are a unique object type in AD that has automated password management and the account can be used on more than one server. You can read a bit more about gMSA accounts here. Particularly noteworthy is that the password is automatically changed every 30 days, the accounts cannot be used for interactive logins, and the account can't be locked out. 

There's a little extra configuration up front for a gMSA, but then you no longer have to worry about password changes for that application. This SQL 2017 installation series will use gMSAs so you can see how to configure them. 'Standard' AD based service accounts will work as well, but will have manual password management.

​Domain Controller Configuration

​In order to use a gMSA you need at least one domain controller that is running Windows Server 2012. If you have never used a gMSA then it's likely you will need run a command to enable gMSA accounts within the domain. This only needs to be run once, so skip if your domain has already been configured. Open an elevated PowerShell on a domain controller and run this command:

Add-KdsRootKey –EffectiveTime ((get-date).addhours(-10))

It may take up to 10 hours for the change to take effect. So if you have issues creating a gMSA, wait and try again.

​gMSA Account Creation

​There are two primary ways to create a gMSA account. The first and most cumbersome is using various PowerShell commands. It's very tedious and error prone. Or, the quick and easy method via a free GUI tool. ​You can download the free tool here​. Install the tool on your first SQL server using all defaults.

I recommend using two gMSAs for each SQL AAG instance. One is for the database engine and the other is for the agent service. Use a logical naming standard for the accounts, and make sure the account name is 14 or less characters.

1. Launch the free gMSA tool and and click on New. Enter the appropriate 14 character or less account name for the agent service account. Add a useful description. Select the appropriate AD container.

​2. When you are prompted to assign computers to the gMSA click Yes. Add the two SQL servers to the gMSA, validating each one. After you add them and click OK​ and the home page of the tool should now show one gMSA and the assigned computers.

​3. Repeat this procedure for the database engine gMSA account. I called this account MSA-SQL01-D. The GUI tool should now show two gMSA accounts on the home page.

SQL Server Preparation

1. On both of the SQL servers run the following elevated Powershell command to enable the usage of a gMSA.

Enable-WindowsOptionalFeature -FeatureName ActiveDirectory-Powershell -Online –All

​2. On both of the SQL servers run the following elevated Powershell command to 'install' the gMSA account on the server, using your service account names. Run for both of the gMSAs (database engine and agent service).

Install-AdServiceAccount MSA-SQL01-A
Install-AdServiceAccount MSA-SQL01-D

​3. You should now test the service accounts via Powershell to make sure they work. The command should return 'true' for both accounts.

Test-ADServiceAccount -Identity MSA-SQL01-A
Test-ADServiceAccount -Identity MSA-SQL01-D

​User Rights

​In order to support large page tables and instant database initialization we need to configure a couple of user rights in the Local Security Policy of each node. Add to the Lock Pages in Memory and Perform Volume maintenance tasks user rights the database engine service account. ​Do this on both SQL VMs, and reboot.

Do NOT add either of the SQL service accounts to the local administrators group. This is very bad for security, and not required for proper SQL operation.

​SQL Administrator Group

​I recommend configuring an Active Directory security group that will contain all of the SQL administrators. This makes adding/removing of SQL administrators much easier. If you already have an appropriate security group in AD, use it. For this exercise I will use SQL-Admins. Add the appropriate users to this group before proceeding.

​Summary

​In this post we created two group managed service accounts (gMSA) that we will be using for SQL 2017. These accounts require a bit more up front work to configure, but have the advantage of never needing to change the password. If you do not wish to use gMSA accounts, feel free to use the 'standard' AD user object for the service accounts. This guide is based on gMSA accounts, 

​Next up in Part 4, we will install SQL 2017 on the first node. We will capture this configuration, and use it as a template for an unattended installation of SQL on the second node.

​SQL 2017 Installation Series Index

​SQL 2017 Always-on AG Pt. 1: Introduction

SQL 2017 Always-on AG Pt. 2: VM deployment

​SQL 2017 Always-on AG Pt. 3: Service Accounts

SQL 2017 Always-on AG Pt. 4: SQL Node A Install

SQL 2017 Always-on AG Pt. 5: SQL Node B install

SQL 2017 Always-on AG Pt. 6: Cluster configuration

​SQL 2017 Always-on AG Pt. 7: Fileshare Witness (coming)

​​​​​SQL 2017 Always-on AG Pt. 8: Max memory and Email (coming)

​​​​SQL 2017 Always-on AG Pt. 9: SQL Maintenance (coming)

​SQL 2017 Always-on AG Pt. 10: AAG Setup (coming)

​SQL 2017 Always-on AG Pt. 11: Kerberos (coming)

SQL 2017 Always-On AG Pt. 2: VM Deployment

First up in this series is deploying your two Windows Server 2016 VMs, which will be the SQL 2017 Always-on availability group nodes. Each VM should be configured with the same VM hardware properties. Selecting the right virtual hardware configuration is important, regardless of which hypervisor you are using. Memory, compute, networking and storage can all impact SQL 2017 performance. Remember that more is NOT always better, and in fact consuming more physical resources than you need (e.g. vCPUs) can actually hurt performance. Always take the time to 'right size' your VMs.

​Memory

​Memory configuration is very dependent on the SQL workload and your applications. In order for SQL server to use large page tables, which can boost performance, the VM needs at least 8GB of RAM. In general I don't reserve VM memory at the hypervisor level, but for tier-1 SQL VMs I would reserve the memory. This is an additional step in VMware environments, and is automatic with AHV as it doesn't overcommit VM memory as of the date of this post. 

Database workloads love memory, as it is vastly faster to access data in memory than on disk. So you should not skimp on VM memory allocation, but neither should you over provision. Mentally reserve 2-4GB of VM memory for the OS, and then add on top of that how much memory you want to dedicate to SQL. Later in this series we will optimally configure SQL memory usage by adjusting SQL runtime parameters.

If your hypervisor supports hot-plug of memory, you can enable that feature without negatively impacting performance. However, remember that if you do increase the VM memory that the SQL memory runtime parameter we cover later in this series needs adjustment as well or the additional RAM will not be used by SQL. This requires a restart of the SQL instance.

Keep in mind that Microsoft SQL is NUMA (non-uniform memory access) aware, so there's not a major performance hit if you configure the VM for more memory than is in one NUMA node. But if you can keep VM memory allocation within a single NUMA node, that will maximize VM memory performance.

​vCPUs

​vCPU allocation is very dependent on the expected workload. If you allocate more vCPUs than exists on a NUMA node, performance may be negatively impacted. For example, if your server is dual socket and each socket has 8 physical cores, try and allocate 8 or fewer vCPUs to the VM.

Understanding NUMA can be a bit complicated, as different hypervisor versions and vendors handle it differently.  NUMA configuration can also potentially impact SQL licensing, as the guest OS ​will see a differing number of sockets and CPUs depending on how the hypervisor presents the NUMA topology to the VM. 

If you are using VMware vSphere, check out this blog post to get a good understanding of how they handle vNUMA in various vSphere versions. If you are using Nutanix AHV, check the latest version of the AHV Administration Guide on the support portal for current guidance and configuration steps. Tip: Search on the Nutanix portal for "vNUMA".

If you are using VMware vSphere, take note that enabling CPU hot add will disable vNUMA. So for all SQL VMs, I would NOT configure CPU-hot add. Schedule VM downtime to manually reconfigure the vCPUs and reboot the guest OS. 

2018 has seen a number of CPU security issues arise, such as Spectre, Meltdown, and L1TF. Mitigations for these security issues can negatively impact CPU performance and even reduce the total number of usable logical CPU threads in a physical server. Make sure to take these mitigations into consideration when sizing your SQL VMs.

​Networking

Networking configuration is very straight forward. For Nutanix AHV there's only one virtual NIC type, so you can't go wrong. Just add a NIC to the VM, and you are set. If you are using VMware vSphere, use the VMXNET3 NIC. A single vNIC per VM is sufficient. Be sure to use the latest VMware tools or Nutanix AHV NIC drivers.

​Storage

​Properly configuring storage for SQL is a bit more complicated than networking. Configuration recommendations will differ for both the hypervisor you are using, as well as the underlying storage platform (e.g. SAN, Nutanix, etc.). If your storage vendor has recommendations for optimal SQL configuration disk configuration, follow them. 

​If you are using Nutanix AHV, use the "SCSI" disk type for all of the VM disks. If you are using VMware vSphere, I recommend using the default LSI SAS controller for the OS boot disk and the PVSCSI controller for all other disks. ​​​You can​ use the PVSCSI for the boot disk as well, but that disk should not see very many IOPS and requires additional manual steps to install Windows on a PVSCSI disk. ​​​See my blog post here for injecting the VMware drivers into your Windows image here

As a starting point for disk configuration, the following is a good baseline for a basic SQL server using a generic storage back-end. You will very likely have a different configuration, but consider the disk layout below. Many times I will use mount points so you don't run out of drive letters, but for this example lets keep it simple. I also recommend unique drive sizes (even if just a 1GB difference) so you can easily match up hypervisor disks and in-guest disks. The drive letters below are what I'll be using through the rest of this series.

C: 40GB - OS Drive
D: 20GB - SQL Binaries
​G: SQL Data Files 1
​​H: TempDB data 1
I: TempDB data 2
​J: TempDB Logs
​K: SQL DB Logs
​L: SQL Backups

The Windows boot drive for SQL does not need to be huge. SQL binaries, databases and logs should all be stored on disks other than the C drive. So the boot disk should just be large enough for the OS, future patches/upgrades, swap space, etc. 40GB seems to be a fair number. Adjust as you see fit. All of the other disks ​are extremely environment specific and should be tailored to your requirements for both size and quantity.

​As mentioned above a production quality SQL server should have a number of guest disks ​for optimal performance. In vSphere environments all of the SQL-specific disks (e.g. data files, log files, temp DB, backups, etc.) should use the PVSCSI controller. However, for best performance you should add 3 PVSCSI controllers to the SQL VM and then evenly distribute the SQL disks over those three virtual controllers. If you are using Nutanix AHV no virtual controller card configuration is required as the SCSI disks are optimally configured under the covers. However, it is advised that you use the latest Nutanix AHV VirtIO drivers for best performance. 

 If you are using Nutanix based storage, regardless of hypervisor choice, the proper disk layout is needed for optimal performance. On Nutanix it is important to spread the SQL database files and TempDB files across multiple disks. For example, if you have a database that needs medium to high performance I/O it could be beneficial to spread the database files over 4 or more virtual disks. If the DB is 1TB in size, then configure it for 4x 256GB files on four separate disks as a starting point. The same goes for TempDB, using multiple TempDB files on several disks could increase performance. See the Nutanix SQL Best Practices Guide for more details.

​The C and D (SQL binaries) drives should use the default NTFS allocation size. However the remaining SQL disks should use 64K. Many moons ago partition alignment was an issue, but modern OSes are smarter and manual configuration is no longer needed. In virtual environments you have a choice of disk provisioning type (e.g. thin, thick, EZT, etc.). In nearly all cases thin provisioning is the preferred disk format, particularly on Nutanix storage. If your storage vendor has a different recommendation, follow it.

Finally, if you are using the Nutanix platform, follow the most current Nutanix SQL Best Practices guide for optimal container (datastore) configuration in terms of data reduction services such as compression, de-dupe, and EC-X. Recommendations can change over time as the Nutanix platform evolves.

Pro Tip: When creating the SQL VMs add the boot disk first, so that it has the lowest SCSI ID. This will enable Windows to put all of the needed partitions on your boot disk. Otherwise you may lose 500MB on another disk and create future headaches if you want to make disk changes down the road.

​Summary

​Optimizing the VM hardware configuration is very important for peak performance. VM configuration will vary slightly based on hypervisor and underlying storage platform. I strongly recommend creating a standard SQL VM configuration baseline, and using it as a template for all other SQL VMs. Then adjust CPU/Memory/Disks as needed for a particular instance. For VMware environments you can easily use VM templates for this purpose. Also remember to use the latest VM hardware version, if you have a choice. 

Once you have installed Windows Server 2016 on two VMs, configured all of the drives, patched, and joined to the domain, then proceed to Part 3. Also make sure you have the SQL 2017 ISO handy and mounted to each VM. If you don't have another Windows VM in your environment that can host a simple and tiny file share (for the AAG file share witness) then deploy a third Windows Server 2016 VM with a single disk.

​SQL 2017 Installation Series Index

​SQL 2017 Always-on AG Pt. 1: Introduction

​SQL 2017 Always-on AG Pt. 2: VM deployment

​SQL 2017 Always-on AG Pt. 3: Service Accounts

SQL 2017 Always-on AG Pt. 4: Node A SQL Install

SQL 2017 Always-on AG Pt. 5: Node B SQL Install

SQL 2017 Always-on AG Pt. 6: Cluster configuration

​SQL 2017 Always-on AG Pt. 7: Fileshare Witness (coming)

​​​​​SQL 2017 Always-on AG Pt. 8: Max memory and Email (coming)

​​​​SQL 2017 Always-on AG Pt. 9: SQL Maintenance (coming)

​SQL 2017 Always-on AG Pt. 10: AAG Setup (coming)

​SQL 2017 Always-on AG Pt. 11: Kerberos (coming)

SQL 2017 Always-On AG Pt. 1: Introduction

SQL 2017

​In this series of posts I will show you how to configure SQL 2017 on Windows Server 2016 with Always-on Availability Groups. I will also be pointing out Nutanix AHV and VMware vSphere configuration best practices along the way. As you will see, 98% of the configuration is hypervisor independent.

​SQL Always-on Availability Groups is no longer "new", and is now the preferred HA option for SQL (assuming your applications support it). Gone are the days of shared SCSI disks/LUNs! SQL AAGs were introduced with SQL 2012, and have been enhanced to various degrees with each release. For "full feature" AAGs, you must use the Enterprise edition of SQL. "Basic" availability groups are available in SQL 2016 and 2017 standard edition, but are quite feature limited. You can read about basic AAGs here, directly from Microsoft. This installation series will only cover the enterprise edition of AAGs, as basic AAGs have quite limited features. A few years ago I wrote a similar set of blogs posts for SQL 2014, which you can find here

Most virtualization or general IT administrators do not have deep DBA skills, so this series is aimed at the general IT administrator and makes general configuration recommendations along the way. Each SQL instance is unique, and will likely deviate from generic settings in this series. So just take the sizings in this series as an example, and adjust accordingly for your needs. CPU, disks, memory and such are all workload and application dependent. Follow your application best practice recommendations, when available.

As previously mentioned, the vast majority of the settings in this guide are hypervisor and storage platform independent. If, however, you are a Nutanix customer I strongly suggest you download our SQL Best Practices Guide, which I was a contributing author. If you are using the VMware platform, I also recommend buying the "Virtualizing SQL Server with VMware." The book has a plethora of advanced SQL advice, which is applicable to any hypervisor. So even if you aren't using vSphere, the book has valuable configuration and performance tuning advice that pairs nicely with Nutanix AHV or Microsoft Hyper-V. 

​Today virtualizing SQL is no longer considered unusual, and in fact is one of the most popular applications to virtualize. However, that doesn't mean you can just "click next" or use all the defaults and expect tier-1 performance or availability. This series will give you a good starting point for your enterprise SQL 2017 deployments. 

​What to Expect

​This is not a "click next" installation. Topics covered in this series include:

  • vSphere and Nutanix AHV configuration best practices
  • Windows firewall configuration
  • SQL SSL certificates (optional)
  • Configuring e-mail alerts and notifications
  • Implementing maintenance and backup plans

​SQL 2017 Installation Series Index

​SQL 2017 Always-on AG Pt. 1: Introduction

SQL 2017 Always-on AG Pt. 2: VM deployment

SQL 2017 Always-on AG Pt. 3: Service Accounts

SQL 2017 Always-on AG Pt. 4: Node A SQL Install

SQL 2017 Always-on AG Pt. 5: Node B SQL install

SQL 2017 Always-on AG Pt. 6: Cluster configuration

​SQL 2017 Always-on AG Pt. 7: Fileshare Witness (coming)

​​​​​SQL 2017 Always-on AG Pt. 8: Max memory and Email (coming)

​​​​SQL 2017 Always-on AG Pt. 9: SQL Maintenance (coming)

​SQL 2017 Always-on AG Pt. 10: AAG Setup (coming)

​SQL 2017 Always-on AG Pt. 11: Kerberos (coming)

Nutanix NPX Architecture Guide How-To (Part 2)

This post is part 2 of 2, of my NPX Architecture Guide how-to series. In this post we will cover sections 9 through 14, from the outline below. You can check out the first part of this series here. At the end I will also give you some more tips on various standard tables that I used throughout the document. 

The major sections in the architecture guide are:

  1. Overview
  2. Current State and Operational Assessment
  3. Design Overview
  4. Nutanix Capacity and Sizing
  5. Nutanix Cluster Design
  6. Host Design
  7. Network Design
  8. Storage Design
  9. Security and Compliance
  10. Management Components
  11. Virtual Machine Design
  12. Data Protection and Recoverability
  13. Datacenter Infrastructure
  14. Third-Party Integration

9.0 Security and Compliance

NPX architectuire guide security and compliance

Security is often a weak point for many architects, so be sure to not skimp on this section. If it's light on details, be prepared for defense questions. Topics to cover here include, but are not limited to: RBAC design (Prism/hypervisor/applications); SSL certificates (Prism/NGT/Hypervisor); system hardening (STIG, PCI/DSS, etc.); network security (microsegmentation, VLANs, ACLs, etc.); patching/compliance reporting; use of SSH and hardening (e.g. SSH keys); syslog configuration (TCP/UDP); PulseHD; use of two factor authentication; Nutanix password complexity settings. 

And depending on your technical and business requirements, there very well could be additional security areas you need to cover. Have you read the Nutanix security guide? Make sure every "i" is dotted and every "t" crossed. 

10.0 Management Components

NPX architecture management components

The control plane for your solution is very important. Don't make management overly complex, as one of the beauties of a Nutanix solution is simplicity. Things to consider here include: Prism configuration; Nutanix patches and AOS upgrades; how to monitor Nutanix, what about OS patches?; Hypervisor patching; what tools are you using to monitor the network?; What about monitoring the hypervisor?; You of course have VMs, so what tools are monitoring them?  

In the management arena don't forget about advanced automation tools such as Puppet, Chef, PowerShell, and Nutanix Calm. What are you using for Syslog? Splunk? Are you using Prism Central or Prism Pro? If so, why, or if not, why not? 

11.0 Virtual Machine Design

NPX Architecture Virtual Machine Design

As called out in the NPX blueprint, you must include your virtual machine design. What is that? Well it should cover topics such as VM templates, VM virtual hardware, are you making use of SCSI unmap or not? And what are the implications of using or not using unmap? What's the difference between Linux/Windows unmap? How about VM affinity or anti-affinity rules? What is the lifecycle of your VMs (cradle to grave)? Do you have monster VMs? What is your NUMA boundary and do any VMs cross it? What are the implications of NUMA? 

12.0 Data Protection and Recoverability

NPX architecture data protection and recoverability

What good is your solution if it's not protecting your business critical data, and ensure that you can recover from a disaster? Here think about covering: Backup software (Netbackup, Veeam, HYCU, etc.); Nutanix data protection (Protection domains, replication, snapshot frequency, sync/async replication, etc.); network configuration protection (e.g. nightly switch config backups); storage protection; hypervisor control plane backups; how to protect critical infrastructure services like AD/DNS; what is your VM backup frequency?; What are your operational procedures for areas such as change control, patch management, and business continuity? 

13.0 Datacenter Infrastructure

NPX architecture datacenter infrastructure

To many people the datacenter infrastructure can be a scary topic, and lightly covered in the architecture guide. But it's critical. What if you miscalulate (or don't calculate at all) the heat load for your proposed solution and it melts down in the rack or causes a fire? What does your rack elevation look like? Did you allow space in the rack for logically placing new nodes? What is the datacenter rating for the maximum heat load of an individual rack? What types of PDUs are you using and how many? Did you adjust your estimated amps usage for the powerfactor? Are you running your PDUs at more than 80%, sustained during a failure situation? What is your datacenter facility rated for in terms of downtime a year? And is that downtime planned or unplanned? How many more nodes can you add to the rack before you exceed the rated limits (cooling, power, or weight)? Is your solution going to fall through the floor because you didn't validate your assumption about maximum load rating (did you even ask?)? 

14.0 Third-Party Integration

NPX architecture third-party integration

Another scoring area for the NPX is your ability to cover third-party integrations. What does that mean? For the NPX, that's any non-Nutanix product which you include in your solution. I recommend a separate section, even if you have touched on these solutions throughout your guide. Why? Makes finding it easier, and your panelists will like that. The areas you will cover here a highly solution dependent, so you may have fewer, more, and likely different products to cover than I did. For my solution it used Splunk, NetBackup, VMware vSphere, and also VDI. 

Sample Design Decision Table

NPX architecture design decision table

Throughout your architecture guide you absolutely must thoroughly document your major design decisions. How many design decisions you will have totally depends on your solution, and how thorough you want to be. In my case I had 60 design decisions, and each one was captured using the template above. The I placed full design decision table in the appropriate major section of the guide (e.g. networking, security, etc.). At the front of my architecture guide I had another table, consisting of one line per design decision, for easy reference. 

Now this design decision table is not "perfect" and in fact I would argue needs supplementation. I could, and should, have done it better. But first let's start with what's in the able, and what I left out which I think you should have in it. First, you need to label and capture a one sentence description of the decision. For example, are you using the VMware standard switch or distributed switch? Next, every single decision has an impact. What is that impact? Describe it. Nearly every decision has a risk....what is it? And every risk needs a mitigation, so what is that? 

Now, what do I think you should include that I didn't? "Alternative design decisions". Why do I think you need alternative design decisions for nearly EVERY decision? Because you will likely be asked about it during your defense. For example, let's say Design Decision 40 was to use LACP. Ok that's fine and dandy, but what are the alternative(s) to using LACP and why didn't you use them? Or, what if you chose the NX-3060-G6 node for your baseline node type. What would be an alternative node type that could also work? These are EXACTLY the types of questions you need to be prepared for during your live defense. But thinking about them before AND documenting them in your guide, you are that much closer to successfully defending your NPX design. 

So yes, IMHO, I think every single design decision you have should be documented with: Impact, risks, risk mitigation and alternatives. 

Sample Assumptions Table

NPX architecture assumptions

You might be thinking, what's so special about an assumption table? Don't you just capture all of your assumptions and call it a day? NOPE! Epic fail! For each assumption you need to validate, if at all possible, your assumption. Document how you will validate it. 

Sample Summary Table

NPX architecture design summary table

Now this table I think is optional, but I included it for both my VCDX and NPX designs. At the end of each major tech section (e.g. 4.0 - 14.0) I have a section called "Summary and Design Decisions". The summary is the table above, which captures and quickly displays all of the referenced requirements, constraints, assumptions, risks, and design qualities covered in that major section. I think of the table as a 'double checking' that I've covered all of the requirements, constraints, assumptions and risks applicable to that major section. Is this table required? Nope. Do I like it? Yup. Should you use it? Totally up to you.

Additional Architecture Guide Tips

One of the hallmarks for a "X" level (VCDX/NPX) level architecture guide is traceability. What is that? It means for every labeled item (requirements, constraints, assumptions, risks, design decisions) it needs to be called out at least ONCE in the main body of your document. NPX examiners DO use the search functionality quite often to see, for example, if risk RS05 is actually addressed in your document. As you write your guide, and as a final QA, take an afternoon and search for every single labeled item and MAKE DARN SURE it's referenced in the body of your guide. 

Another tip that I find exceptionally helpful for starting a new architecture guide is this: First construct the outline of all the areas in the NPX (or VCDX) blueprint as major sections (e.g. network design, storage design, security and compliance, etc.). The under each major heading, just like I have, construct your sub-headings of conceptual, logical and physical. Then at the end of each major section have a design justification summary, and then your summary table and design decision tables. After you do all of this 'pre' work, you will have a nicely outlined guide that you can now start filling in the details. Easy, right? 

Applicable to VCDX?

So you may be thinking, well thanks for all of the tips for a successful NPX architecture guide, but does this apply to the VMware VCDX certification or other enterprise architect certs? And the answer is ABSOLUTELY! In fact, I used the *exact* same format for my VCDX certification and it was accepted and successfully defended on my first attempt. If it's good enough for NPX/VCDX, it is good enough for customer facing docs? The answer here is also a resounding yes. 

The tips I've provided here are for an enterprise level architecture guide, and "X" level certifications like VCDX and NPX are very similar in the skillset they attempt to asses. So can you take your NPX architecture guide, if based on vSphere, and submit it for VCDX? With only minor modifications to ensure you cover VCDX blueprint areas, the answer is yes! I did the reverse....started out with a VCDX-level design, added Nutanix blueprint areas, and submitted it for the NPX. 

Conclusion

As you can see through these two long blog posts, a "X" (expert) level architecture guide can be a monster. It covers a lot of areas, needs full traceability, must be concise and easy to read, organized so that the examiners can easily score it, and also be technically accurate. It also needs enough depth to be considered "X" (expert) level. Although I will emphasize that there's no minimum page count, I would find it very hard to believe that something as short as, for example, 30 pages for an enterprise architecture guide would pass muster. 

If you follow my advice in these two posts, it should get you well on your way to having a well organized, detailed, and easy to read/score architecture guide for your NPX (or VCDX) defense. 

 

Nutanix NPX Architecture Guide How-To (Part 1)

A couple of months ago I successfully defended my Nutanix Platform Expert (NPX) certification and became number 14 in the world to obtain the certification. You can read all about that journey here. As part of the NPX certification process you submit a documentation package which should cover all of the areas in the NPX blueprint. This package will consist of multiple documents, but it's entirely up to the author on how to organize and present the content called out in the blueprint. This post is part 1 of a 2 part series, covering how my NPX architecture guide was organized. 

There is no "NPX" document template or "magical" format that will guarantee acceptance of your work, and enable you to do the in-person live defense. So please don't just copy this outline as-is, throw in a few sentences under each topic, and think you are good to go. Just use these two posts as inspiration for your NPX submission, and help ensure you cover all blueprint areas. 

Straight from the NPX Design Review blueprint your documentation package must include the following content, or the submission will be rejected:

  • A current state and operational readiness assessment
  • A web-scale migration and transition plan
  • Documentation of specific business requirements driving the solution design
  • Documentation of assumptions that impacted the solution design
  • Documentation of the design constraints that impacted the design and delivery of the solution
  • Documentation describing risks identified in the design and delivery of the solution and how those risks were remediated
  • A solution architecture including conceptual/logical and physical design with appropriate diagrams and descriptions of all functional components of the solution
  • An implementation plan
  • An installation guide
  • A test and validation plan
  • Documentation of operational procedures

And also directly from the NPX blueprint, the following categories will be judged:

Conceptual/Logical Design Elements

  • Scalability
  • Resiliency
  • Performance
  • Manageability and control plane architecture
  • Data protection and recoverability
  • Compliance and security
  • Virtual machine design logical design
  • Virtual network design
  • Third-party solution integration

Physical Design Elements

  • Resource sizing
  • Storage infrastructure
  • Platform selection
  • Networking infrastructure
  • Virtual machine physical design
  • Management component design
  • Datacenter infrastructure (Environmental and power)

As you can see, the NPX blueprint covers a lot of ground. Although page content is NOT specified, and longer is NOT always better, typical submissions can exceed 200 total pages (spread across multiple documents).

Where to Start

One of the first tasks when you are starting down the NPX path is to plan out your documentation, and decide what NPX blueprint content will be in what documentation. Again, there's no hard and fast rule here. And as an "X" (expert) level architect, you should have a good idea how to do this. Logical organization is KEY to allowing the NPX panelists to quickly and properly evaluate your documentation. If it's hard to find the blueprint areas for scoring purposes, you are not doing yourself any favors. Make it dead easy find each and every required documentation criteria. 

For my joint submission with my NPX partner Bruno Sousa, we decided on the following physical documents (in no particular order):

  • Completed NPX application PDF form
  • Resume
  • DevOps essay
  • Architecture Design Guide
  • Implementation Plan
  • Installation Guide
  • Operational Vertification
  • Operations Guide

In this blog post I will focus on the Architecture Design Guide, as that is where the majority of the content lives. That's not to discount all of the other docs, but time wise, I found myself spending the most on the Architecture Guide.

NPX Architecture Guide

As can see from the NPX blueprint, you are required to have conceptual, logical and physical elements to many design areas (virtual machines, networking, etc.). A natural progression from conceptual, logical and physical in your documentation makes following your thought process easy. As you will see from my documentation outline, for most areas I made specific headings called "Conceptual Design", "Logical Design" and "Physical Design". That makes it super obvious that 1) You've covered the areas in the NPX blueprint 2) You know the difference between each 3) Allows the reader to logically follow your thought process. A simple but key tip. I've seen more than one NPX submission that made it very difficult to follow the author's thought process.

With all of that being said, now let's dive into the actual outline of my NPX Architecture Guide so you can see how it was organized. Again, this is not the magical outline, and you can diverge from this to suit your style and design. This is just how Bruno and I organized the document.

The major sections in our architecture guide are:

  1. Overview
  2. Current State and Operational Assessment
  3. Design Overview
  4. Nutanix Capacity and Sizing
  5. Nutanix Cluster Design
  6. Host Design
  7. Network Design
  8. Storage Design
  9. Security and Compliance
  10. Management Components
  11. Virtual Machine Design
  12. Data Protection and Recoverability
  13. Datacenter Infrastructure
  14. Third-Party Integration

The remainder of this blog post will touch on highlights from each area, and have screenshots of the actual outline from our submission.

1.0 Overview

NPX Overview

The overview is a very brief 3-4 page description of the entire solution, at a 30,000 foot level. Why was this project needed? What are the roles and responsibilities of all parties involved (hint: use a RACI chart)? What was the customer project sign off process?

2.0 Current State and Operational Assessment


NPX current state and operational assessment

If you are in a brownfield environment and are doing any type of upgrades, migration, etc. you will need a current state and operational assessment section. As you can see from the outline above, it needs to be very comprehensive. For example, I included the following:

  • Performance baseline (storage, compute) - Charts, graphs, and IOPS/bandwidth measurements
  • Full VM inventory (OSes, largest VM, high performance VM metrics, etc.)
  • Full operational readiness assessment
  • Gap analysis

Giving the reader a good picture of the current state environment is key, as the remainder of the document will build upon this foundation. Capturing performance metrics is key, so that you know how to properly size the new environment, and then validate the new environment can support the projected workload.

3.0 Design Overview

NPX design overview

The design overview is massively important, as this section captures requirements, constraints, assumptions, risks, and design decisions. And it also has 10,000 foot conceptual, logical and physical diagrams of the proposed solution. 

Each requirement, constraint, assumption, risk and design decision should have a unique reference number, which will be used throughout your entire documentation package. Tip: If you have requirement R10 (for example) in the table, but don't reference it anywhere in your doc package, that's a big problem. Validate each and every item that has a unique identifier is used at least once elsewhere.

The screenshot below is a small sample of what my requirements table looked like. The number of requirements will vary greatly from design to design. In my case I had 22, but you may have dozens more if the solution is complex. 

I have seen candidates break out 'technical requirements' (TR) and 'business requirements' (BR) into separate tables. That's certainly a valid approach, and makes perfect sense. I combined all of mine in one table. 

NPX requirements

For your risks table, it's not adequate to just list the risk. You must also have a mitigation for each and every risk. 

4.0 Nutanix Sizing and Capacity

NPX sizing and capacity planning

As previously covered, you can see here that I used specific headings of "Conceptual Design", "Logical Design" and "Physical Design" for the sizing and capacity section. This forces you to think and present logically your solution. 

For each logical sizing unit (compute, memory, storage) I had a table similar to the following, which clearly shows my assumptions and how I arrived at the logical sizing unit. This logical sizing unit is then used later for the physical sizing of the cluster. 

NPX server virtualization CPU logical sizing

5.0 Nutanix Cluster Design

NPX Nutanix cluster design

I'll get tired of saying this, but for the Nutanix Cluster Design I also followed the conceptual, logical and physical flow. Key areas to cover here are scalability and resiliency, in addition to all of the physical components. 

6.0 Host Design

NPX host design

Host design covers the compute design plus the hypervisor of your choice. And again, I used the progression of conceptual, logical, and physical to enable the reader to understand my thought process. 

7.0 Network Design

NPX network design

The networking section is pretty self explanatory. You need to have sufficient depth here that you convey your "X" level knowledge of networking. For example, are you using ECMP? Why or why not? Where in the network is routing taking place? What routing protocol? Are you using leaf/spine or a 3-tier design? Microsegmentation? Any SDN solutions? LACP? NIC teaming? NIOC? How many network ports does your solution require? How many free ports are there for future expansion? How would the network scale out as more nodes are added? What's your network security look like? 

Networking is often a weak point for architects, so if that is your situation, I suggest seeking out experts to help with your design. For example, do you know any CCIEs? Or is there a networking best practices author within your organization? If you just brush over key network details in your documentation, don't be surprised if during your defense you get quizzed more. So be prepared! 

8.0 Storage Design

NPX storage design

Just like networking, the storage section is pretty straight forward. Conceptual, logical and physical headings make another appearance. Be sure to cover all Nutanix storage details here, such as compression, dedupe, EC-X, data locality, shadow clones (if used), RF-level, etc. 

Summary

As you can see from the first eight sections of the NPX Architecture Guide, there are a ton of details that you need to cover. It took me over 96 pages to cover these eight sections in what I thought was sufficient "X"-level detail. In Part 2 of this series, I will cover sections 9 - 14, and give you more tips about what I included in each section.

My Journey to Nutanix Platform Expert (NPX) #014

Almost four years ago to the month my career took a major turn. I just successfully passed the VMware VCDX datacenter virtualization certification, to become #125 in the world. You can read about my VCP5 to VCDX journey in 180 days here. And the following week I joined this somewhat little known and scrappy startup, called Nutanix.

Back then HCI (Hyperconvered infrastructure) was a new fangled technology that many, many were quite skeptical of. Was it enterprise ready? Was it good for anything more than VDI? Could it run SQL, Oracle, Exchange? Can it compete toe-to-toe with vBlock? Would Nutanix go belly up or be acquired? Betting my career on HCI was risky in 2014, but it's paid off in more ways than I could ever imagine.

Having the VCDX certification under my belt really prepared me well for dual role at Nutanix as both a Solutions Architect in engineering and a consulting Architect in our sales organization. As a solutions architect I wrote a number of published customer facing Nutanix Best Practice Guides, such as SQL, Veeam, Lync, Microsoft DFS, and others. And as a field Consulting Architect I worked with dozens of customers over the years in projects of all sizes and shapes. Both roles helped refine my enterprise IT architecture skills, and hands-on with our own products including AHV (Nutanix's hypervisor).

NPX (Nutanix Platform Expert)

Just about a year into my career at Nutanix in March 2015, Nutanix announced the NPX certification. You can read my blog post about it here. I was honored to be part of the team that helped develop NPX and came up with the criteria for what it means to be NPX. The bar that was set was even higher than other defense based certifications, such as VCDX. Why? You have to know two hypervisors at the "X" level as well as demonstrating enterprise grade IT architect skills. Our MQC (minimally qualified candidate) bar is high, and the first time pass rate is far from 100%.

Now you may wonder why it's now 2018, three years after NPX 'went live' and I am just now defending. Well to be frank, having dual roles in Nutanix for over 3 years left little to zero time in my life to do blogging or spend the hundreds and hundreds of hours it takes to prepare for the NPX. For my VCDX I estimated I spent over 1,000 hours of preparation and 250+ pages of documentation. So I knew NPX would be even harder.

I am a competitive person, and I also like proving to myself that I can be on a similar footing as my colleagues which I have immense respect for. They were getting their NPX's and they kept badgering me to get mine. Plus, I want be the best customer facing consultant that I can be, and I knew doing NPX would take my VCDX skills to an even higher level. I also very recently shifted roles a bit within Nutanix to focus on our largest global accounts. The job description for that role requires NPX-level skills. Immense pressure was building on me to successfully defend NPX.

The NPX Design

In early 2017 I decided to start putting time into my NPX preparation. NPX requires a real-world design that you've done, so I thought there's no better choice than taking my UCS/HP 3PAR VCDX design and migrate it a Nutanix based solution. So I dredged up all my VCDX documentation from 3 years ago, and read it over. I was shocked to remember how complex 3-tier solutions are, and in particular the SAN/RAID/LUN configuration.

Going through my VCDX design I was ripping out page after page of complexity. LUNs? Gone. SAN? Gone. Fibre Channel switches? Gone. Boot-from-SAN? Gone. Cisco service profiles? Gone. You get the idea. And the best part about it? The actual environment that my VCDX was based on, I was actively involved in the account to migrate them to almost entirely Nutanix. So my NPX had a dual purpose of both defending, and transforming a real Nutanix customer from 3-tier to Nutanix simplicity. Win-Win!

NPX Preparation

For anyone starting down the NPX path, the freely available NPX Blueprint is your Bible. It has all the topics you need to cover to properly submit and successfully defend for NPX. To get a copy, email npx@nutanix.com. It is absolutely critical that you follow the Blueprint to the letter and cover everything, including all of the required documents. Although all of the documents are important, to me the Architecture Guide is where you will spend the majority of your time. My VCDX Architecture guide was 185 pages, and my NPX version was 134 pages. That's nearly 50 pages less, nearly all due to removing complexity, while covering more topics for the same ​environment.

After you get all of your documentation in order, next comes submission time. The NPX application is quite detailed and requires things such a resume, 3 professional references, a web-scale essay, plus all of the documents you've spent probably 6-9 months working on. After submission your documentation is scored, and if it scores high enough, you are then invited to an in-person defense. Submission time is roughly 3 weeks prior to the published defense dates.

If accepted, now is time to start working on your PowerPoint slide deck for your defense. You will use this slide deck to walk the panelists through your 90 minute defense, where you will be asked questions about your design, alternatives, and why you did what you did.

Pro Tip: Take all the blueprint topics and create one slide for each topic. Fill each slide with what you think are the top items to cover. Even if you don't verbally cover all bullets on the slide, have the content there so the panelists can ask questions.  I had approximately 23 content slides in my presentation, plus a number of indexed backup slides. 

I've included my TOC for my NPX deck below. This is not a magical slide...it's all directly from the blueprint. This is just one way to do it...do what feels right to you. 

​Now that your slide deck is ready, you need to mock mock mock! Don't use a potted plant to talk to...use social media and your contacts to find other NPXs or people working on their NPX. Do a webex, Zoom, etc. Practice practice! Heck, if your design is based on vSphere, hit up VCDXs.

But don't forget to mock the troubleshooting and design scenarios. Those two areas are also key for scoring, and just don't wing it during your defense. Aim for multiple mocks for each of the three areas: defense, troubleshooting, design scenario.

My personal goal was to get through the slide presentation, uninterrupted, in 30 minutes. That leaves 60 minutes for panelists to ask questions. YMMV, but I'd advise not going much longer or you jeopardize your scoring chances.

​Dooms (I mean Defense) Day

​By now you should be comfortable with your design, mocked each of the three major sections of the defense, and probably didn't sleep too well the night before. But be rested! Also if you are traveling across time zones, try to arrive a couple of days early to help adjust. You don't want to be a jetlag zombie during your defense.

When you step into the defense room, for those that have done your VCDX, everything will look familiar. Three panelists, moderator, whiteboard, and a projector. The moderator will give you the rules of the road, then you start your presentation. Panelists can interrupt at any time during your presentation to ask questions. Questions are not bad! In fact, they are asked to help improve your score and make sure you know your design. After the 90 minutes you get a 15 minute break. 

Next up is the 30 minute troubleshooting scenario. You will be shown a few slides, then the timer will start. The panelists are looking for a methodical approach to solving the problem, not a scattershot process of asking random questions or throwing out guesses to the root cause. The goal is not to solve the problem, but show how you would solve it. Curve balls can be thrown if you get close to the 'real' answer. At the end of 30 minutes you get a 5 minute break.

Finally, is the 60 minute design scenario. Just like the VCDX, you are shown slides for a particular fictitious customer. The panelists then act as the customer, and you ask them questions about requirements, constraints, assumptions, and risks. You then start down the design path answering questions as you go. And before you know it, the 60 minutes are up!

Now that you are totally mentally drained, now is the waiting game. Thankfully, you won't have to wait long. My results came in about 90 minutes after I was done. I was on the London underground, which has quite spotty cell service. I got the results via Slack and email, but then cell coverage dropped for a few tube stops. So I couldn't tell anyone I had passed! LOL I did shed a couple tears of joy and a couple of passengers were looking at me oddly. 

​Final Thoughts

​Is the whole process worth it? Yes! Even if you don't successfully defend, just the entire learning process makes you a better enterprise architect. Passing is just icing on the cake. Just like VCDX, the first attempt pass rate is fairly low, so don't be discouraged if you don't do it the first time around. Think of it as a chance to make yourself even better and really kick butt the next time! ​

I want to give a huge shout out to my NPX partner in crime, Bruno Sousa. We collaborated on the entire design, and split up the documentation work. His insight and knowledge was impeccable.

As a side note, pair/group submissions are allowed, but each contributor will defend individually.

I also want to thank the numerous people that supported mock sessions, document reviews, and pushing to keep my head down and being a success to become NPX #014.