Quantcast
Channel: VMware Communities Syndication Feed
Viewing all articles
Browse latest Browse all 58

Upgrade your NetApp DataONTAP! (7.3.3P2/P4)

$
0
0

 

Are you a NetApp customer?

Do you know which version of DataONTAP your storage system(s) are running?

 

 

As of today, we are currently running on 7.3.1.1P8. This system had 337 days of uptime. I was anxiously awaiting my 1 year milestone, as some of you may have seen me posting about it on Twitter. Some of the NetApp crowd was also involved in sharing/bragging about this. I also did not want to update it prior to being gone nearly the entire month of September for VMworld and Oracle OpenWorld.

 

 

Enter BUG #332110: Driver refresh for X1008, X1010 dual port 10G ethernet card

 

 

***If you're a NetApp customer, you can clickhereto see the full report

 

 

Driver refresh from vendor to fix known problems in both hardware and software.

Vendor found a problem with the Media Access Control(MAC) in the T3B2(not in T3C) revision of the

chip. Only X1107 use T3C in NetApp.

When the MAC is under high load, it could get into a mode that does nothing except transmitpause frame. It would not even forward received traffic up to the host. Only reboot the filer could get the MAC out of that mode. The driver refresh include a work around to detect and reset the MAC portion of the chip at run time.

Internal MAC flowcontrol is enabled in the refreshed driver. A port of that that was ported to the old

driver to fix 313558.

The support for X1106/X1107 are added in this driver.

 

 

So begins our story...

 

 

Friday night (technically Saturday morning), I get a call from a DBA around 3AM that says, "Hey, Oracle can't access it's NFS mounts..."

 

 

sigh "OK, let me check..."

 

 

Dig a little deeper, call our Network Engineer, and then see that ALL vm's are disconnected. Oh dear.

 

 

Dig a little deeper, and see that NO 10GbE traffic is passing to the NetApp. Oh dear.

 

 

We had hit the bug. And the really bad part? It didn't actually take the filer or the interfaces down/offline, so cluster failover didn't take place.

 

 

What did this do for us? OPENED OUR EYES! We need to keep up with patches/upgrades better. We need a physical Domain Controller in place, because when the filer tried to come back up, it couldn't find our virtual domain controllers that were offline, and therefore, the authentication part of the "giveback" process failed.

 

 

Sidenote: "Dear NetApp, please give us some flexibility on the Active Directory integration. I've got dozens of Domain Controllers all around the country, but the installation only looks in the local site/subnet where the filer resides. Had it traversed outside, it would have found MANY online DC's."

 

 

We need to configure more granular things to monitor latency to our storage systems, because technically, it never went down.

 

 

Lastly, I had made some configuration changes to our VIF configuration over the course of 337 days of uptime. What I didn't realize is that those changes were never written to the RC file, which is what initializes and creates all of your VIFs on boot. So, we came back up on a single interface. Essentially, our entire company is running on a single 10GbE interface now. A decision was made to get the company back up and online rather than to try and troubleshoot/reconfigure everything, and that we would deal with this in a later planned downtime.

 

 

Tomorrow night, we now have to have an additional 2-4 hour planned emergency outage to correct all of these things? Why? Because I didn't stay on top of my upgrades, and didn't stay on top of my configs.

 

 

Lesson learned? You bet.

 

 

If you're using NetApp storage with any of the bleeding edge stuff like 10GbE/FCoE expansion cards, keep your stuff up to date. It is bound to have bugs, all software does.

 

 

-Nick

 

 


Viewing all articles
Browse latest Browse all 58

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>