Tuesday, December 30, 2014

OpenDedup Review

It's been quite some time and 2014 has been quite eventful, but I apologize for not writing anything useful for over a year. Incidentally, I've done this experiment back in Aug, but didn't publish it, so here goes.

I've been pondering how to back up my data in a safe place for a while and thought of renting my own server online then use rsync (Windows & Linux) to update contents. To keep costs down, I thought of using deduplication solutions and searched for what is available currently under Linux, since that's what I'll run on my rented server.

I found OpenDedup.org and it looked quite promising, though 2 years old only. Their code trunk on Google Code has commits from July, and their forums do have active users and I saw a patch being mentioned and will be pushed on Aug 24.

I created a VM with Debian 7.6 (x64) & installed the prerequisite, which is Java 7 (Yuck!). The OpenDedup debian package contains the binaries needed for mkfs and CLI tools. There are other packages for web based management, but I didn't try those.

I threw mkv files at the volume (via Samba/Windows share) and the dedup ratio was quite low (1%). Then I went to my work laptop, and copied a directory that contains info about all companies I worked on (projects), and the files are a mix of doc, docx, pdf, xls, xlsx, jpg, png & visio files. The end result is instead of consuming 3.5 GB, it took 2.9 GB only. I copied another directory of size 852 MB, and it was reduced to 730 MB.

There is a penalty that OpenDedup creates logs, hash tables & data chunk files, and those keep growing, so your data has to be large enough to not notice those, but if it's small, like my sample above, you won't see much gain overall.

Another issue with OpenDedup is that in my tests, it was wonky and sometimes the JVM crashed. It rarely happened throughout the day's test, but it did happen, and as it is Java based, it makes me ill and approach it with caution.

The technology looks good, but not useful for multimedia files (videos & images) as those are already compressed. They tout 95% deduplication ratio for VMware virtual machines, though, but that's more because the VMs will run similar operating systems.

OpenDedup is performing deduplication in the background and on an interval basis, so it's not inline (live), which means that one still needs to buy full disk capacity, then "hope" for savings. This is the same for all deduplication technologies in general, not just OpenDedup, which is also seen in enterprise storage systems, such as EMC's, and they have mentioned that in their white papers, in addition to the performance penalty, which is why it's not recommended to use it for production volumes sensitive to latency.

If you're still wondering on whether I found a good way to backup, the answer is: SpiderOak (referral link). In conjunction with my OpenDedup tests, I tested SpiderOak, and they yielded the same deduplication ratio as OpenDedup (on their servers).

SpiderOak has a Zero Knowledge policy and design, which means that their systems can never see what data is being put there, at rest or during upload. Only devices with the client installed and with the correct password can access the data.

In addition to having a smart client that only uploads differences and not entire files, they perform deduplication on their end, so you don't have to pay for a lot of storage.

The last feature that made me love it and stick with it, is the fact that you pay for capacity regardless of the number of devices. You can have an unlimited number of devices using the purchased capacity, and you can traverse all the files for all backed up devices from the same user interface.

Saturday, August 17, 2013

Configuring FCoE on IBM Flex nodes and V7000 Storage

Disclaimer

I work for an IBM partner. The opinions depicted in this post are solely mine. Any performance degradation, bug, problem mentioned here is generic to any vendor, unless otherwise strictly specified.

Article Revisions

v1.0 - August 17 (17082013) - Initial release
v1.1 - August 17 (17082013) - Small additions to CN4093 section
v1.2 - October 11 (11102013) - Correction to FCoE frames and Ethernet frames (Thanks Anonymous!)

What is FCoE?

FCoE is short for Fiber Channel over Ethernet. It's an encapsulation of FC packets inside Ethernet packets, allowing a server/node to communicate with a storage system through standard Ethernet, instead of investing in dedicated FC infrastructure.

The idea is to combine, or converge as the industry likes to call it, multiple protocols into a single technology, to reduce the datacenter clutter. With Ethernet, you can now transmit data packets (Ethernet), iSCSI (storage protocol) and FCoE (storage protocol).

Issues with FCoE

No Real Optimization

A payload is the data carried by the protocol from one point to another. The typical Ethernet frame payload is 832 bytes, while FC frames can carry 2kB, and iSCSI fits perfectly into Ethernet's packets.
Standard Ethernet supports an increased payload size with something called Jumbo Frames up to 9kB. IPv6 allows a maximum payload size of 4GB (minus 1 byte) but that requires modification of the Transport Layer to allow TCP/UDP to carry larger payloads, and is no longer done on the Ethernet frame.

FCoE runs on Ethernet frames limited to or a max of 9kB when Jumbo Frames are used. However, iSCSI was designed for Ethernet and runs on the Internet Protocol (IP) on top of Ethernet, which allows it to make use of IPv6 Jumbo Frames' large payload size.

So, if you're going with an overhead of protocols already, you might as well go with iSCSI on IPv6 (assuming the storage supports it) and enable Jumbo Frames (assuming the network backend supports very large Jumbo Frames), instead of going with FC over Ethernet!

Overhead and Replication

iSCSI has been in the business for a long time (6+ years) and works well with the standard Ethernet payload size, but better with jumbo frames. Many storage systems offer iSCSI and have been offering it for a long time. It also doesn't require any special protocols and it "just works" including across data-centers, as long as the link the stable (but there's latency, obviously).

FCoE is relatively new, and requires a heap of protocols to maintain a main "feature": Being Lossless. All this protocol overhead means extra latency, and from what I've been reading, it's not yet possible to push FCoE between datacenters. This means that whenever you need data replication across datacenters, you'll need to attach your storage box into an FC-only infrastructure, which means investing in an FC infrastructure! (OK, maybe just 2 switches, but they still cost money!)

Note: This IBM document mentions that it is possible to replicate between different V7000 storage systems via FCoE only, but the article above puts the limitation on the router/networking end, not on the storage end. Also, that post is 3 years old, so things might have changed now. Approach with caution, anyway, and validate with your vendors.

Another overhead is the encapsulation encoding and decoding process. Disks speak SCSI protocol, and what FC packets do, is put the SCSI commands and their data inside an FC packet, then send it over to the storage, which will strip out the FC packet, then execute the SCSI commands and data.

With FCoE, the server is inserting a SCSI payload inside an FC payload and that is inserted into an Ethernet payload!

Multi-Protocol Failure

Currently, FCoE is enabled on Converged Network Adapters (CNAs) that offer standard Ethernet functions (normal network access) + iSCSI + FCoE. When enabling FCoE, the CNA presents to the Operating System (OS) a bunch of storage adapters of type FC.

What happens when you have a failure on the adapter? You lose both network access and storage access. What happens if the FCoE switch fails? You lose both network access and storage access.

A related scenario would be storage upgrades where one path needs to be offline to move the equipment from old stuff to new stuff, this means affecting both network and storage. One more scenario is your usual network administrator mistake where the spanning tree configuration goes wrong, adds a new VLAN, or plugs a cable into a non-configured switch and the network goes into a loop (think of it as a denial of service attack).
While one would think that if the network fails, then why do you need storage access, is completely valid, the issue here is that the sudden loss of storage could also lead to data corruption.

That's why I personally prefer to keep the two activities separate: Network and Storage. It can still be done if you buy separate switches for FCoE and Ethernet, but then where is the "convergence" of your datacenter and its cost reductions?

Port Reservation

I don't know about other vendors, but on the IBM Flex chassis switches, using FCoE requires reserving 2 external ports from the switch (must be Omni Ports), even if you're using a V7000 Flex plugged into the chassis.

The FCoE protocol requires having an FC Forwarder (FCF) even if the traffic is internal to the chassis. These ports have to be reserved and configured in pairs. 2 ports are needed for every storage system to be connected via FCoE.

You do not need to plug SFPs into these reserved ports.

Limited Communications to Storage Systems

FCoE communicates through VLANs on the Ethernet network. Each NIC must belong to one VLAN when talking to an FCoE target. Because of that, a NIC can only talk to one storage system. If you need a node to talk to multiple storage systems, you'll need to assign each pair of NICs to a separate FCoE VLAN belonging to each FCoE storage system.

This limitation is not there for FC infrastructures, as a node's FC adapter registers itself on the FC SAN fabric, and then the administrator zones (groups) each adapter with a storage system, and an adapter can belong be grouped with multiple storage systems, as long as all storage systems use the same FC adapter settings.

The FCoE connectivity limitation can be avoided by virtualizing various storage systems under one storage system, and expose that one system to the nodes. IBM's Storage Volume Controller and its little brother the V7000 can do that.

Administration Role Separation

In large organization, the roles of a network admin and a storage admin are separated. With Network Convergence, who will be responsible for configuring the network switches? Will the admin take responsibility for both network and storage?

Lab Setup

Alright, enough blabbing. Let's get to business. This is the lab setup for this experiment:

  1. IBM Enterprise Flex Chassis
  2. Two x240 nodes (Intel processors)
    1. Windows Server 2012 was preinstalled by a colleagure so I used it for tests
    2. Installed ESXi 5.1 U1 (IBM Customized image) for Boot from SAN tests
  3. One 4-port CN4054 CNA on each node
    1. Firmware: 4.4.180.3
    2. Feature on Demand (FoD) to enable FCoE
  4. V7000 Flex storage (mounted into the chassis)
    1. Firmware: 6.4.1.3
  5. Two CN4093 converged switches
    1. Firmware: 7.5.3
    2. Base license, allowing use of only 2 ports on the 4-port cards
IBM's Flex chassis allows one to contain nodes, Ethernet switches, FC switches, and storage, all into a 10U chassis, and the communication between the components is internal to the chassis at a minimum of 10Gbps. End of shameless plug.

Note0: The 4-port CNA is made by Emulex, and it has the same chipset found on the 2-port LAN on Motherboard (LoM) built into some x240 nodes.

Note1: The firmware levels above are important and you should meet these as a minimum. As of this writing, the storage has newer firmware, but I kept it at this level as it's the minimum required and for testing purposes.

Configuration Overview

  1. Configure x240 nodes and their CNAs
    1. Understanding the CNA
    2. Possible NIC Configurations
    3. Configure FCoE Feature on the NICs
      I won't cover OS configuration nor multipathing driver installation
    4. Configure nodes for SAN Boot via FCoE
  2. Configure V7000 Storage
  3. Configure the CN4093 Converged Switches
    1. Sample Configuration
    2. Configuration Explanation
  4. Profit!
If you need help upgrading the firmware of any component, refer to the device's user manual in the device links posted above. I won't cover these here.

Note: Throughout the guide, screenshots and configuration, I have masked the WWPNs and MACs of the devices used in the lab, because I'm paranoid. Deal with it.

1) Configuring x240 nodes and their CNAs

This is easy, but you could lose yourself within the forest of menus, so I have a few screenshots to make you happy. You can either follow the text description, or spoon-feed yourself with my awesome screenshots.

Understanding the CNA

I'll quickly explain how the CNA is going to function, so that you don't get confused when you configure it.

The 4-port 10Gbps CNA and the 2-port LoM, have 4 physical ports, and 2 physical ports respectively. When enabling Multichannel functionality, the CNA automagically splices each physical port into 4 virtual ports (vNICs).

So physical port 1 will have 4 vNICs: A1 = A1.1 + A1.2 + A1.3 + A1.4. Each vNIC can be allocated bandwidth, not exceeding 10Gb, and the total bandwidth of 10Gb is shared among all 4 vNICs, so you cannot over-commit the bandwidth. So, in an OS, you'll see 8 NICs if you have a 2-port LOM, and 16 NICs if you have a 4-port CNA (4 vNICs per physical port).

You can change the bandwidth allocation dynamically from the switch for any port, live. It's up to you how much bandwidth is allocated to the FCoE port. If you have a license to use all 4 ports, I suggest you use Ethernet on the 1st and 2nd NICs, and FCoE on the 3rd and 4th. This way, you'll be able to allocate full 10Gb to FCoE.

Use the NICs in sequence (1+2, 3+4) to make sure Ethernet passes through both CN4093 switches, and FCoE passes through both CN4093 switches. Ports 1 and 3 communicate with switch0 located in Bay1, while ports 2 and 4 communicate with switch1 located in Bay2.

Possible NIC Configurations

  1. Use physical NICs
  2. Use virtual NICs
  3. Use a mix of pNICs and vNICs

Remember that a 2-port LOM will have each of its physical ports connect to 1 switch. port0 to switch0 and port1 to switch1. So, if you have 2 switches only, you have to use option (2): vNICs.

vNICs are mandatory if you want to share Ethernet and FCoE on the same pipe and you want to guarantee bandwidth for FCoE. If you do not use vNICs, FCoE and Ethernet will compete on the bandwidth. If your servers are busy, it may lead to delayed I/Os and performance degradation.

My favorite configuration is if you have a 4-port adapter, and Upgrade1 switch licenses for your 2 switches, then you can use 2 ports as pNICs for FCoE and 2 ports as pNICs for Ethernet. No need for vNIC configuration.

Alternatively, you can also enable vNICs on the first 2 ports, and leave the 3rd and 4th ports as pNICs. Or the opposite. So you can mix, but they'll have to be in pairs.

If you have a 4-port adapter, with the base license of the switches, your options are the same as the LOM, in the first paragraph.

Configure FCoE Feature on the NICs

  1. Power on the node and press F1 to login to the UEFI setup
  2. UEFI main menu -> System Settings -> Network -> Select 1st NIC (PFA 17:0:0 here) -> Emulex 10G NIC
    You're now at the Emulex NIC Selection menu
  3. Notice the link speed. It should report a number.
  4. Switch Configuration: Change it to IBM Virtual Fabric -- default: Switch Independent
  5. Personality: Change it to FCoE -- default: NIC
  6. Multichannel: Enable if you want to enable vNICs
  7. Controller Configuration -> View Configuration
  8. The 2nd NIC should report itself as FCoE. Only 1 NIC will have FCoE functions.
    Notice that the numbering of the NICs is all even. These NICs belong to switch0 located in Bay1.
  9. Press Esc until you're back at the Emulex NIC Selection menu.
  10. Feature on Demand -> Install FCoE license
  11. You're now done with the first NIC. The 2nd NIC will have the same settings as the 1st. You will need to repeat the steps above for the 3rd NIC, and that NIC's settings will be applied to the 4th.
  12. Press Esc until you're back at the Network menu and select the 2nd NIC.
    Notice that the NICs have odd numbers. These are mapped to switch1 located in Bay2.
  13. Esc to the System Settings menu -> Emulex Configuration Utility
    If you do not see this option, Esc to UEFI main menu, save, then exit to reboot the node.
  14. Highlight the 1st NIC (001) but don't click on it. Write down the NIC's Port Name and node name in a text file for later use. Highlight the other NICs and write their PNs.
    If you don't have an Upgrade1 license for your CN4093 switches, you won't be able to use the 3rd and 4th NICs, so you can ignore them.
  15. Click on the 1st NIC. You're now at the Emulex Adapter Configuration menu.
  16. Configure DCBX Mode: Change it to CEE -- default: CIN
  17. Later on when you're done configuring the storage and the switch, come back here and run Scan for Fiber Devices and you should see the V7000 listed (ID 2145)
    Also, scroll down and click on Display Adapter Info sub-menu and you'll see the FCoE VLAN ID, if your switch was configured properly. This is auto-discovered.
  18. Esc to the Emulex Adapter Configuration menu, and select the 2nd NIC (002) then repeat the same steps.
  19. Esc to UEFI main menu, save and reboot back to the Scan for Fiber Devices for later use.
With those easy steps, you have completed ONE node. Repeat the same for all nodes. If you're fortunate enough to have had ordered the Flex System Manager node, then it's your lucky day! You can create a Configuration Template of the configured node, which would capture its hardware component configurations, and deploy its hardware configuration to other nodes. It's magic.

If you intend to use the Configuration Templates, I recommend that you configure all the components (internal disk RAID setup, time, boot order, ...etc.), then create the template out of the node.

UEFI -> System Settings menu

Network -> Select NIC

Click on that to get the juicy settings

Change settings as listed. Multichan is for vNICs

Showing the available options

Showing the available options

Click it!

FCoE vNIC is always the 2nd

FCoE requires a license. Install it.

To the next adapter

The 2nd NIC follows the settings of the 1st


Emulex Configuration Utility for FCoE HBA Settings

Select 1st NIC. Note Port Name for FC zoning

Change settings to CEE

After storage and switch config is done, scan fiber devices

Configure nodes for SAN boot via FCoE

Each canister/controller will have 1 port looking at one switch and the other port looking at the other switch, which means on each switch you'll see both controllers.

This guide is specific to V7000 and V7000 Flex and VMware ESXi 5.1 (IBM Customized Image). For other storage types, I highly recommend you read and follow the steps in the "Storage and Network Convergence Using FCoE and iSCSI" redbook (link in references). It explains booting from SAN with FCoE and iSCSI, and has excellent tips.

  1. Configure the FCoE switches and make sure that the storage and nodes are functioning properly.
  2. Create a volume and assign it to the node that will boot from SAN. It must be the first volume assigned to the node (LUN 0/SCSI ID 0).
  3. Boot the node into UEFI -> System Settings menu -> Emulex Configuration Utility
  4. Select the 1st adapter
  5. Set Boot from SAN: Change it to Enable
  6. Validate storage connectivity and volume/LUN assignment: Navigate to Add Boot Device -> Select 1st Controller
    If you don't see the storage or LUN 0000, then you need to finish configuring the switches, assign a LUN to the node, then come back here.
    Do not select a boot device here. This is only for validation of connectivity.
  7. Configure HBA and Boot Parameters -> Boot Target Scan Method: Select Boot Path Discovered Targets
    Commit Changes.
  8. Esc to Adapter Selection menu and select the 2nd adapter, and repeat the same steps.
  9. Configuring FCoE SAN boot should be sufficient on 2 ports.
  10. Esc to System Settings menu -> Devices and I/O Ports -> Enable/Disable Onboard Devices
  11. SAS Controller: Disable to disable booting from local disks on the node. Do this even if you don't have any local disks.
  12. Esc to Devices and I/O Ports -> Device Boot Priority
  13. Drag the SAS Controller to the bottom of the list. Save/Commit.
  14. Esc to Main Menu -> Boot Manager -> Add Boot Option -> Generic Boot Option
  15. Add Hard Disk 0, 1, 2, 3
    If you configure 2 FCoE ports, you'll have 4 possible paths to boot from. By selecting 4 disks, the UEFI will configure each path into a Hard Disk, and boot from the first available one.
  16. Esc to Boot Manager -> Delete Boot Option: Delete anything that you don't need (PXE, Floppy)
  17. Esc to main menu -> Save
  18. Reboot and install OS
Note: During adapter preparation phase in UEFI, it'll probe the FCoE ports and see which one is online, and will nominate and use one of them only.

Steps 7 and 15 allow high flexibility and reduce configuration time for implementations that have many nodes. The typical method of implementation is defining the boot LUN and path for each node. So if you have 10 nodes, and 2 FCoE ports, you'd need to repeat those configurations 40 times! Using Boot Discovery and auto Hard Disk assignment by UEFI, you avoid this headache.

It does add some extra time to the boot process, but it's not really important at the advantage of flexibility.

Emulex Configuration Utility

Adapter Selection

Enable Boot from SAN for both adapters

You should be able to see storage and LUNs here

Do not add the LUNs. Just validate connectivity.

Configure HBA and Boot Parameters

Boot Path Discovered Targets

Add Boot Option -> Generic Boot Option

Add Hard Disk 0, 1, 2 and 3 for a total of 4 paths

Devices and I/O Ports

Enable / Disable Onboard Devices

Disable SAS Controller


2) Configure V7000 Storage

If you have purchased the V7000/V7000 Flex with the FCoE daughter cards, there's no configuration for FCoE. If you bought the daughter cards at a later stage, you'll need to activate them from the canisters. This won't be covered here. Please refer to the online manual.

If you login to the V7000's web interface, you'll see each canister's (controller) Port Numbers, for both FC and Ethernet. You'll see these numbers once the switch is configured.

V7000 Flex canister/controller 1

V7000 Flex canister/controller 2

Note that the port type is FC



3) Configure the CN4093 Switches

As mentioned before, a minimum of 2 external Omni ports must be reserved, even if you're using a V7000 Flex inside the same chassis as the nodes.

This switch configuration will assume default bandwidth allocations. I highly advise you to read the CN4093 redbook (link in references) for optimizations.

I'll first write the entire switch config, then explain each section.

Login to the switch in "iscli" mode, then type "enable" to access the enable mode. Now type "config terminal" to be able to modify.

version "7.5.3"
switch-type "IBM Flex System Fabric CN4093 10Gb Converged Scalable Switch"
!
system port EXT15-EXT16 type fc
!
interface port INTA1
name "Flex System Manager node"
no flowcontrol
exit
!
interface port INTA2
name "Power p260 node"
no flowcontrol
exit
!
interface port INTA3
name "x240 node1"
tagging
no flowcontrol
exit
!
interface port INTA4
name "x240 node2"
tagging
no flowcontrol
exit
!
interface port INTA5
tagging
no flowcontrol
exit
!
interface port INTA6
tagging
no flowcontrol
exit
!
interface port INTA7
name "v7000 flex node1"
tagging
pvid 1002
no flowcontrol
exit
!
interface port INTA8
name "v7000 flex node2"
tagging
pvid 1002
no flowcontrol
exit
!
interface port INTA9
tagging
no flowcontrol
exit
!
interface port INTA10
tagging
no flowcontrol
exit
!
interface port INTA11
tagging
no flowcontrol
exit
!
interface port INTA12
tagging
no flowcontrol
exit
!
interface port INTA13
tagging
no flowcontrol
exit
!
interface port INTA14
tagging
no flowcontrol
exit
!
vlan 1
member INTA1-INTA6,INTA9-INTA14,EXT1-EXT2,EXT11-EXT16
no member INTA7-INTA8
!
vlan 1002
enable
name "fcoe"
member INTA3-INTA4,INTA7-INTA8,EXT15-EXT16
fcf enable
!
!
vnic enable
vnic port INTA3 index 1
bandwidth 25
enable
exit
!
vnic port INTA4 index 1
bandwidth 25
enable
exit
!
vnic vnicgroup 1
vlan 3001
enable
member INTA3.1
member INTA4.1
exit
!
spanning-tree stp 80 vlan 3001
!
spanning-tree stp 113 vlan 1002
!
!
!
!
fcoe fips enable
!
fcoe fips port INTA3 fcf-mode off
fcoe fips port INTA4 fcf-mode off
fcoe fips port INTA7 fcf-mode on
fcoe fips port INTA8 fcf-mode on
fcoe fips port EXT15 fcf-mode on
fcoe fips port EXT16 fcf-mode on
!
!
cee enable
!
!
fcalias v7k_node1_p1 wwn 50:00:00:00:00:04:00:76
fcalias v7k_node2_p1 wwn 50:00:00:00:00:04:00:77
fcalias node3 wwn 10:00:00:00:00:00:00:5d
fcalias node4 wwn 10:00:00:00:00:00:00:6b
!
zone name v7k_node3
        member fcalias v7k_node1_p1
        member fcalias v7k_node2_p1
        member fcalias node3
zone name v7k_cluster
        member fcalias v7k_node1_p1
        member fcalias v7k_node2_p1
zone name v7k_node4
        member fcalias node4
        member fcalias v7k_node2_p1
        member fcalias v7k_node1_p1
zoneset name ActiveConfig
member v7k_node3
member v7k_cluster
member v7k_node4
zoneset activate name ActiveConfig
!
no ip routing
!
!
end

Configuration Explanation

system port EXT15-EXT16 type fc
This changes the type of the Omni ports from being Ethernet ports to FC ports. This is required to bind the ports to a storage system, whether the storage is internal to the chassis or external. If your storage is external, these are the ports where you have to plug the FC SFPs and cables to your external SAN fabric.

interface port INTA1-INTA14
name "port name"
no flowcontrol
tagging
pvid 1002
interface port : defines which ports you want to work on. You can specify 1 port or a range. If you have Upgrade1 license, you can also define INTA1-INTB14 to modify all 28 ports in one shot.

name "port name" : It's better that you do this on a per port basis, to give each port a unique name, to know which system is using that port.

no flowcontrol : Disables traffic flowcontrol. A requirement for FCoE.

tagging : Enable VLAN tagging on a port, allowing that port to belong to multiple VLANs. Do not enable this on ports that will not use FCoE, nor require VLAN tagging. An example to this is a standalone Windows/Linux node.

pvid 1002 : Set the Private VLAN ID (native VLAN) on the port. The default is 1 in all networks. This has to be changed to the VLAN of the FCoE on the V7000 Flex ports. If you do not have a chassis storage, no internal port needs this PVID set.

vlan 1
member INTA1-INTA6,INTA9-INTA14,EXT1-EXT2,EXT11-EXT16
no member INTA7-INTA8
!
vlan 1002
enable
name "fcoe"
member INTA3-INTA4,INTA7-INTA8,EXT15-EXT16
fcf enable
!
These are VLAN definitions, and which ports belong to the VLAN and which don't.
1002 is the preferred VLAN ID for FCoE. You can change this to whatever you want, but make sure the customer network doesn't have the same ID on the Ethernet network to not cause confusion for your nodes.

fcf enable : Enable Fiber Channel Forwarding on this VLAN. This is a must on the FCoE VLANs if you have a V7000 Flex or an upstream (Top of Rack) switch that understands FCoE. If you're connecting the chassis to a SAN fabric, you need to enable NPV mode. See the CN4093 redbook for details.

vnic enable
vnic port INTA3 index 1
bandwidth 25
enable
exit
vnic enable : This is only needed if you need vNICs and want to enable it.

vnic port index 1 : This is vNIC1 of the internal physical port 3. In other words, it's INTA3.1.
You only need to set this, if you want to use this specific vNIC. If you do not set these settings, it'll appear as disconnected on the OS.

bandwidth 25 : Allocate 25% of the 10Gb bandwidth, which is 2.5 Gbps to this vNIC.

Note: You do not allocate bandwidth nor define a vNIC index for the FCoE port.

vnic vnicgroup 1
vlan 3001
enable
member INTA3.1
member INTA4.1
exit
vnic vnicgroup : Create a vNIC Group to add members to it. This is a must for vNIC configurations. Not required for non-vNIC setup.
The group members can be vNICs, internal physical ports, and external ports. In the example above, only internal ports were added. No external ports were configured.

vlan 3001 : Each vNIC Group requires its own VLAN, and this must not be an existing VLAN. This is only for internal communication, and will not conflict with the customer side VLANs.

vNICs not added to a vNIC Group, will appear as disconnected.

spanning-tree stp 80 vlan 3001
If spanning tree is enabled, this will place the VLAN 3001 in its own Spanning Tree Group number 80. The firmware will by default assign each VLAN into its own STG without having to do this manually.

fcoe fips enable
!
fcoe fips port INTA3 fcf-mode off
fcoe fips port INTA4 fcf-mode off
fcoe fips port INTA7 fcf-mode on
fcoe fips port INTA8 fcf-mode on
fcoe fips port EXT15 fcf-mode on
fcoe fips port EXT16 fcf-mode on
!
cee enable

Enable fcoe initialization protocol snooping, which will detect which ports support FCoE and which don't.

fcf-mode off/on/auto : It should be OFF for the internal ports of the compute nodes, and on for the storage and FC ports. You can also avoid messing things, and set this to auto on all ports.

cee enable : Enable Converged Enhanced Ethernet to allow FC packet encapsulation over Ethernet.

fcalias
Define an alias to make it easy to identify nodes and storage ports.

no fcalias wwn
To remove an already configured fcalias.

zone name
Create a zone and add aliases to this zone.

zoneset name
zoneset activate name
Create a zoneset, which is a group of zones to enable this set for the entire switch.

no ip routing
Disable Layer3 routing, and make the switch a Layer2 switch only.

show fcoe database
-----------------------------------------------------------------------
 VLAN  FCID                  WWN                     MAC         Port
-----------------------------------------------------------------------
 1002  011000     50:00:00:00:00:04:00:77      0e:fc:00:01:10:00   INTA8
 1002  011100     50:00:00:00:00:04:00:76      0e:fc:00:01:11:00   INTA7
 1002  011101     10:00:00:00:00:00:00:5d      0e:fc:00:01:11:01   INTA3

 Total number of entries = 3

-----------------------------------------------------------------------
Displays the currently established FCoE connections on the switch. It doesn't show any node-storage associations. It shows the nodes/storage that has been detected to have FCoE. The section in orange is a sample output.

show zone
List the configured zones on the switch.

For details and explanations of each command, or extra details, do read the CN4093 redbook (linked below in the references).

Note: The above configuration should be the same for the 2nd CN4093 switch, except for the FCalias parts as the WWPNs will be different.

References

  1. IBM V7000 Storage
    1. IBM Storwize V7000 Information Center
    2. Configuration Limits and Restrictions for IBM Storwize V7000
    3. Implementing the IBM Storwize V7000 V6.3
    4. IBM Flex System V7000 Storage Node Introduction and Implementation Guide
  2. Internet Small Computer Systems Interface (iSCSI)
    1. iSCSI Standard by IETF
    2. Comparing Performance Between iSCSI, FCoE and FC
  3. FCoE
    1. Storage and Network Convergence Using FCoE and iSCSI (redbook)
    2. FCoE Between Datacenters
    3. Fixing Stupid, an FCoE Response
    4. FCoE: Additional Considerations (T11 Fiber Channel Committee)
    5. FCoE Questions and Answers (Cisco)
    6. Datacenter Bridging Exchange (DCBX)
  4. Fiber Channel
    1. Fiber Channel Generations (16 Gbps FC)
    2. FC vs iSCSI (Trusted Network Solutions)
    3. FC Frames
  5. IBM CN4093 and EN4093R
    1. Application Guide for EN4093 and EN4093R - Second Edition
    2. Application Guide for CN4093 - First Edition
    3. IBM Networking OS 7.5 Release Notes for CN4093
  6. Emulex
    1. Emulex Universal Multichannel Reference Guide (Guide for the CN4054 VFA)
    2. White papers and documents for cards by Emulex made for IBM
    3. More white papers
    4. Emulex Virtual Fabric Adapter drivers, firmware and user guide
  7. Network Frames
    1. IPv6 Packets
    2. FCoE Frames
    3. Jumbo Frames
    4. Ethernet Frames
    5. Internet Protocol (IP)

Thursday, August 8, 2013

Linux NIC Bonding and VLAN Tagging with IBM Flex Chassis

What is IBM Flex Chassis?

IBM Flex chassis is the new blade technology from IBM which replaces the 10 year old BladeCenter H chassis. Like the BladeCenter chassis, the Flex can fit fully functional network switches into the chassis (unlike Cisco which puts dummy pass-thru modules that plug into top of rack switches).

Environment Setup

My customer had 1 Flex chassis in the main site and another in the disaster recovery (DR) site. Each chassis had IBM 10Gb EN4093 Scalable Switches. The 2 switches in each chassis were interconnected via Virtual Link Aggregation (VLAG) to load balance the traffic between each other. They were connected to 1 Cisco ToR switch. The spanning tree protocol (STP) was PVRST+.

Some of the nodes/servers in the chassis were running VMware & some were running RedHat Enterprise Linux (RHEL) 6 with Oracle RAC setup on top of that.

The Oracle nodes needed to have multiple IPs belonging to multiple VLANs. The nodes had only 2 internal 10Gb NICs, so NIC Bonding with VLAN tagging was the best choice. I used Linux's native NIC Bonding.

As of this writing, Emulex does not have a NIC Bonding software for their chips on the IBM Flex nodes.

The Problem

The RHEL nodes were configured with active-passive NIC teaming, but they were losing connectivity randomly and Oracle RAC would report that one of the configured interfaces could no longer communicate and the cluster is affected.

The Solution

The chassis switches act as 2 switches: 1 switch to the nodes and 1 switch to the outside world. Because of this, even if the switch loses connectivity to the outside world, the internal nodes wouldn't know about the uplink failure. Also, for some reason, the MACs weren't being updated on the Cisco ToR switch.

So, instead of using "miimon" which monitors the physical link between the node and the internal ports of the switch, I changed it to "arp" which will send ARP requests through the ToR L3 switches and that will keep the MAC table refreshed and prevent the IPs from flapping on the nodes.

Configuration

The following configuration was done on RHEL 6. It should work similarly on all distributions, but the location of the files may differ.

Enabling NIC Bonding and Setting Parameters

Append this line to the file /etc/modprobe.conf
# bonding config
alias bond0 bonding
options bond0 mode=active-backup arp_interval=50 arp_ip_target=10.10.5.1,10.10.1.1

arp_interval value is in milliseconds. You can specify multiple target IPs. I suggest adding 2. The maximum allowed is 16. The target IPs are the IPs of a VLAN's gateway.

Configuring Interfaces with VLAN Tagging

cd to /etc/sysconfig/network-scripts and create the following files

File name: ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes

File name: ifcfg-bond0.105
DEVICE=bond0.105
BOOTPROTO=static
IPADDR=10.10.5.112
NETMASK=255.255.255.0
GATEWAY=10.10.5.1
ONBOOT=yes
VLAN=yes

File name: ifcfg-bond0.101
DEVICE=bond0.101
BOOTPROTO=static
IPADDR=10.10.1.112
NETMASK=255.255.255.0
GATEWAY=10.10.1.1
ONBOOT=yes
VLAN=yes

The file name has to end with bond0.VLANID, and the device name has to match that. The IP address schema can be whatever was defined by the network team on that VLAN.
The network engineers I worked with, create VLAN IDs & IP schemas like this:
VLAN 105 -> 10.10.5.x
VLAN 1055 -> 10.10.55.x

You don't have to follow the same way, but it makes it easy to know the VLAN ID from the IP.

You can repeat the above steps and create as many files as you have VLANs.

Modify the following files:

File name: ifcfg-eth0
DEVICE=eth0
BOOTPROTO=none
MASTER=bond0
SLAVE=yes
ONBOOT=yes

File name: ifcfg-eth1
DEVICE=eth1
BOOTPROTO=none
MASTER=bond0
SLAVE=yes
ONBOOT=yes

Repeat these steps for the number of NICs that you have & want them to participate in the NIC Bonding group.

Configuring The Chassis Switches

The only thing missing now is creating the VLANs on the switches, then enabling VLAN Tagging on the nodes' NICs (internal ports) on the chassis switches. I'll be using the "iscli" command line interface instead of "ibmcli."

It's better to have firmware 7.5.3+ on the switches before proceeding, else some commands may be different, and some features may be missing (like Auto Spanning Tree Group assignment), and it'll require that you do extra work manually.

Enabling VLAN Tagging on the internal ports:
interface port INTA1-INTA14
tagging

Create VLANs:
vlan 101
enable
member INTA1-INTA12,INTA13,INTA14

This will create VLAN ID 101, and place the internal ports 1-14 in it, which belong to nodes 1-14. I wrote it this way to show how you can define non-consecutive ports.

vlan 105
enable
member INTA1-INTA14

The default private VLAN ID (PVID) is 1. This is the native VLAN. Any non-tagged traffic will be siphoned there. If the customer's native VLAN ID is different, change this value. If the customer does not intend to have any untagged traffic, it's better to change this value to something that doesn't exist on the customer side to create a black-hole on the internal switches for unwanted untagged traffic.

To change the PVID:
interface port INTA1-INTA14
pvid 5

Assuming the native VLAN at the customer side is 5. To set it to something that doesn't exist, agree with the customer on a VLAN that they'll never use. In my case, I often use 3999.

interface port INTA1-INTA14
pvid 3999

You don't have to create the VLAN beforehand. The switch will automatically create the VLAN, assign it to its own Spanning Tree Group (STG) and change the PVID of the defined node ports.

That's it! Now restart the network services and bonding interfaces should come up.

References

  1. RHEL: Linux Bond / Team Multiple Network Interfaces (NIC) Into a Single Interface
  2. Linux Ethernet Bonding Driver HOWTO
  3. NIC Bonding for KVM (has cute graphs)
I highly recommend that you read the 2nd link (Kernel guide) before doing anything. It explains the different types of modes (active/passive, active/active, EtherChannel, ...etc.) and whether they require ToR switch support or not.

Caution

Remember that you cannot use active/active in an EtherChannel/PortChannel manner because the 2 internal NICs in each node belong to two different switches, and EtherChannel require that the ports belong to the same switch. It is possible if you stack the two chassis switches, but I have not attempted this before.

Also, make sure the STP used on the IBM switches match whatever is there on the customer side, otherwise you'll cause a network loop and bring down the entire customer network!

May your packets serve you well.