Learn some helpful IT Administrator tips and tricks.

Welcome to the most comprehensive list of tips and tricks for IT field, you'll find anywhere on the internet. I hope these tips help you get the most out of your internet.

Cloud Computing Next Generation of your company

Benefits of cloud computing is increased efficiency; services are rapidly deployed and ready for use in your company. Find out about the benefits of moving your business to cloud computing....

Support Tips and Tricks

Tips and Tricks for. Learning Before. Helping. Learning. Service. Research Computing... We are ...

Server, Network, System, Application | Diagram

Client-side Examples; Server-side Examples; Client-side Advantages ... this concept is to view the following diagram and considering some examples: ...

Good roadmap for System Engineer, Network Engineer

Roadmap global customer support professionals are focused on crucial, quick issue resolution and uninterrupted service.. HOW?

Monday, April 23, 2012

Solving the Problem

       Many device or network problems are straightforward to resolve, but others yield misleading symptoms. If one solution does not work, continue with another.
A solution often involves:
  • Upgrading software or hardware (for example, upgrading to a new version of agent software or installing Gigabit Ethernet devices)
  • Balancing your network load by analyzing:
    • What users communicate with which servers
    • What the user traffic levels are in different segments
Based on these findings, you can decide how to redistribute network traffic.
  • Adding segments to your LAN (for example, adding a new switch where utilization is continually high)
  • Replacing faulty equipment (for example, replacing a module that has port problems or replacing a network card that has a faulty jabber protection mechanism)
To help solve problems, have available:
  • Spare hardware equipment (such as modules and power supplies), especially for your critical devices
  • A recent backup of your device configurations to reload if flash memory gets corrupted (which can sometimes happen due to a power outage)
Why do we investigate incidents the key purpose of an investigation should be
     - to preven a future recurrence of the incident
     - determine root cause to prevent similar losses at the same or another location
     - satisfy legal & company requirements and determine the company's liability
     - benefit from lessons learned which may result in improved safety and operation
     - inform employees by keep employees informed about the event and follow up action

Sunday, April 22, 2012

Identifying and Testing the Cause of the Problem

      After you develop a theory about the cause of the problem, test your theory. The test must conclusively prove or disprove your theory.

Two general rules of troubleshooting are:

  • If you cannot reproduce a problem, then no problem exists unless it happens again on its own.
  • If the problem is intermittent and you cannot replicate it, you can configure your network management software to catch the event in progress.
      For example, with"LANsentry Manager", you can set alarms and automatic packet capture filters to monitor your network and inform you when the problem occurs again. See"Configuring Transcend NCS" for more information.

      Although network management tools can provide a great deal of information about problems and their general location, you may still need to swap equipment or replace components of your network until you locate the exact trouble spot.

      After you test your theory, either fix the problem as described in"Solving the Problem" or develop another theory.

Sample Problem Analysis
       This section illustrates the analysis phase of a typical troubleshooting incident. On your network, a user cannot access the mail server. You need to establish two areas of information:
  • What you know - In this case, the user's workstation cannot communicate with the mail server.
  • What you donot know and need to test-
  • Can the workstation communicate with the network at all, or is the problem limited to communication with the server? Test by sending a"Ping" or by connecting to other devices.
  • Is the workstation the only device that is unable to communicate with the server, or do other workstations have the same problem? Test connectivity at other workstations.
  • If other workstations cannot communicate with the server, can they communicate with other network devices? Again, test the connectivity.
The analysis process follows these steps:
  1. Can the workstation communicate with any other device on the subnetwork?• Ifno, then go to step 2.
    • Ifyes, determine if only the server is unreachable.
    • If only the server cannot be reached, this suggests a server problem. Confirm by doing step 2.
    • If other devices cannot be reached, this suggests a connectivity problem in the network. Confirm by doing step 3.
  2. Can other workstations communicate with the server?
    • Ifno, then most likely it is a server problem. Go to step 3.
    • Ifyes, then the problem is that the workstation is not communicating with the subnetwork. (This situation can be caused by workstation issues or a network issue with that specific station.)
  3. Can other workstations communicate with other network devices?• Ifno, then the problem is likely a network problem.
    • Ifyes, the problem is likely a server problem.
When you determine whether the problem is with the server, subnetwork, or workstation, you can further analyze the problem, as follows:
  • For a problem with the server - Examine whether the server is running, if it is properly connected to the network, and if it is configured appropriately.
  • For a problem with the subnetwork - Examine any device on the path between the users and the server.
  • For a problem with the workstation - Examine whether the workstation can access other network resources and if it is configured to communicate with that particular server.

Equipment for TestingTo help identify and test the cause of problems, have available:
  • A laptop computer that is loaded with a terminal emulator, TCP/IP stack, TFTP server, CD-ROM drive (to read the online documentation), and some key network management applications, such as LANsentry Manager. With the laptop computer, you can plug into any subnetwork to gather and analyze data about the segment.
  • A spare managed hub to swap for any hub that does not have management. Swapping in a managed hub allows you to quickly spot which port is generating the errors.
  • A single port probe to insert in the network if you are having a problem where you do not have management capability.
  • Console cables for each type of connector, labeled and stored in a secure place.

Understanding the Problem

      Networks are designed to move data from a transmitting device to a receiving device. When communication becomes problematic, you must determine why data are not traveling as expected and then find a solution. The two most common causes for data not moving reliably from source to destination are:
  • The physical connection breaks (that is, a cable is unplugged or broken).
  • A network device is not working properly and cannot send or receive some or all data.
       Network management software can easily locate and report a physical connection break (layer 1 problem). It is more difficult to determine why a network device is not working as expected, which is often related to a layer 2 or a layer 3 problem.

To determine why a network device is not working properly, look first for:
  •  Valid service - Is the device configured properly for the type of service it is supposed to provide? For example, has Quality of Service (QoS), which is the definition of the transmission parameters, been established?
  • Restricted access - Is an end station supposed to be able to connect with a specific device or is that connection restricted? For example, is a firewall set up that prevents that device from accessing certain network resources?
  • Correct configuration - Is there a misconfiguration of IP address, subnet mask, gateway, or broadcast address? Network problems are commonly caused by misconfiguration of newly connected or configured devices.

Saturday, April 21, 2012

Recognizing Symptoms

       The first step to resolving any problem is to identify and interpret the symptoms."Recognizing Symptoms" The first step to resolving any problem is to identify and interpret the symptoms. You may discover network problems in several ways. Users may complain that the network seems slow or that they cannot connect to a server. You may pass your network management station and notice that a node icon is red. Your beeper may go off and display the message:WAN connection down.

User Comments
       Although you can often solve networking problems before users notice a change in their environment, you invariably get feedback from your users about how the network is running, such as:
  • They cannot print.
  • They cannot access the application server.
  • It takes them much longer to copy files across the network than it usually does.
  • They cannot log on to a remote server.
  • When they send e-mail to another site, they get a routing error message.
  • Their system freezes whenever they try to Telnet.
Network Management Software Alerts
      Network management software, as described in"Your Network Troubleshooting Toolbox", can alert you to areas of your network that need attention. For example:
  • The application displays red (Warning) icons.
  • Your weekly Top-N utilization report (which indicates the 10 ports with the highest utilization rates) shows that one port is experiencing much higher utilization levels than normal.
  • You receive an e-mail message from your network management station that the threshold for broadcast and multicast packets has been exceeded.
       These signs usually provide additional information about the problem, allowing you to focus on the right area.

Analyzing Symptoms
      When a symptom occurs, ask yourself these types of questions to narrow the location of the problem and to get more data for analysis:
  • To what degree is the network not acting normally (for example, does it now take one minute to perform a task that normally takes five seconds)?
  • On what subnetwork is the user located?
  • Is the user trying to reach a server, end station, or printer on the same subnetwork or on a different subnetwork?
  • Are many users complaining that the network is operating slowly or that a specific network application is operating slowly?
  • Are many users reporting network logon failures?
  • Are the problems intermittent? For example, some files may print with no problems, while other printing attempts generate error messages, make users lose their connections, and cause systems to freeze. " You may discover network problems in several ways. Users may complain that the network seems slow or that they cannot connect to a server. You may pass your network management station and notice that a node icon is red. Your beeper may go off and display the message:WAN connection down.

Troubleshooting Strategy

How do you know when you are having a network problem? The answer to this question depends on your site's network configuration and on your network's normal behavior. See"Knowing Your Network" for more information.
If you notice changes on your network, ask the following questions:
  • Is the change expected or unusual?
  • Has this event ever occurred before?
  • Does the change involve a device or network path for which you already have a backup solution in place?
  • Does the change interfere with vital network operations?
  • Does the change affect one or many devices or network paths?
       After you have an idea of how the change is affecting your network, you can categorize it as critical or noncritical. Both of these categories need resolution (except for changes that are one-time occurrences); the difference between the categories is the time that you have to fix the problem.

       By using a strategy for network troubleshooting, you can approach a problem methodically and resolve it with minimal disruption to network users. It is also important to have an accurate and detailed map of your current network environment. Beyond that, a good approach to problem resolution is:

Friday, April 20, 2012

Best Practices for Change Management Process

This is to provide Best Practices for Change Management Process. Change requests for Mission Critical or Significant applications, systems that contain or access High Integrity or Very High Integrity data, systems that contain or access Classified or Confidential-Restricted Access information, and infrastructure components that support these applications and systems must be documented via a Reporting Unit approved change request form. That change request form requires approval by the Information Steward or Delegate. The change request form must, at a minimum, contain the following information: 
  • Who is initiating the change 
  • Who is responsible for implementing the change 
  • Who is responsible for the approval 
  • Business justification for the change 
  • Nature of defect (if applicable) 
  • Testing required and who is responsible for the testing 
  • Back-out procedures 
  • Systems impacted 
  • User contact

Applies to
  • Information Steward :  approver
  • Systems Administrator : logger
  • Developer : programmer
  • Publisher : publish to production environment
  • Users :  User Acceptance testing
  • IP Coordinator :  communicates the process to regional/local IP communities

Thursday, April 19, 2012

How To Prioritization for Incidents

What is a incident?An Incident is a system bug or error, user question, or routine administration request.
Defect Categories Defined –


  • High      Incident of highest relative urgency. Essential Suite may be severely impacted and end-users require immediate assistance. The situation meets one or more of the following criteria:
          1. Any issue that significantly increases the likelihood of a safety or environmental incident occurring and/or the consequence of that potential event
          2. A Mission Critical business process is impacted and no workaround exists.
          3. Impacts 100 users or more.
          4. Work is totally stopped.
          5. System is down completely.
  • Medium
  •      Significant problem for the end-user, may result in financial or other serious impact for Essential Suite. Situation may become of high priority if not quickly addressed. The situation is not high, but meets one or more of the following criteria:
          1. A significant business process is impacted but a workaround exists.
          2. Impacts 50 to 99 users.
          3. Significant loss of work capacity, but can get some work done.
Incident Classification
We classify incidents based on the scenarios defined below:
  • High – System down related issues
  • Medium – User has classified it as moderate priority based on criteria, access related issue, etc.
  • Low – Updating records in system, Scheduling report, Data Mining, Close action items issue, Troubleshooting issues
Times:  High  Medium  Low 
Initial Response Time  <=  2 Hours <= 24 Hours <=  2 Business Days 
Restoration Time for an incident <= 24 Hours <=  2 Business Days  <=  5 Business Days 

Wednesday, April 18, 2012

Best Practice to contact an end user

       Best Practice to contact an end user when a trouble ticket is open and assigned to the analyst.  This suggests an obligation on the part of the customer to be available to troubleshoot the issue.  We must give the customer every opportunity to be available for said troubleshooting -- within reason...

Tips and Pointers
  • Try multiple methods of contact to ensure that every effort has been made to connect with the customer.
  • You can include the final closing email as a saved .msg file in Attachments if you wish to preserve formatting.
  • Try to keep contact efforts to 24 hour increments so that the ticket does not linger overly long.
  • Be polite and thorough in your communications and documentation.
  • Just a reminder, management is collecting metrics on the length of time that tickets remain open, especially tickets with no activity so keep your documentation up-to-date.
  • DOCUMENT ALL ATTEMPTS!
Example Email Template
Template 1
Subject:  PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>.  I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email. Please let me know your availability at your earliest convenience.
Thank you,
<<Analysts Name>>

Template 2
Subject:  2nd Attempt. PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>.  I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email. Please let me know your availability at your earliest convenience.
Thank you,
<<Analysts Name>>

Template 3
Subject:  3nd Attempt.  PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>.  I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email.
This is our third attempt to contact you.  I will close this case unless I hear from you by end of business today.  If you still need assistance with the same issue, please call the Help Desk at 000 000 (or 000-000-0000 after hours) and have your current ticket reopened within 20 days.
Thank you,
<<Analyst name>>


**please replace << >> with the specific information mentioned within brackets

Monday, April 16, 2012

Network Recovery Strategy ~ BCP

       This is to show example for Network Recovery Strategy ~ BCP for network part that can be applied to your business. The network and operations facilities at data center provide business application systems for your business and include:
  • Core Network Services: (Exchange Email, File and Print Services, Internet / Intranet)
  • Business Applications: (ERP, Financials, A/R, A/P, AM, billing), Mainframe printing.
  • User Workstations and associated application software for approximately 150 users in Marketing, Finance, Staff / IT etc.
Disaster Classification:
  • Level 1Temporary (less than 7 days)  Loss of power / water to your building.  This would require the shutdown of the computer room and servers but loss of equipment or data would be minimal.
  • Level 2Significant (greater than 7 days) – building cannot be occupied (fire/water damage, disease, other threat) but city infrastructure is intact.
  • Level 3Significant widespread damage to the city infrastructure (earthquake). Many core services are unavailable; employees are unable to report to work etc.
The IT recovery strategy is primarily designed to respond to level 1 or level 2 disasters.  Level 3 disasters are within the scope of your business resumption plan where the primary focus is on ensuring the safety and security of employees and company assets and providing disaster assistance to the community.

Key Assumptions Example:
  • The primary recovery site for your site is the …
  • The backup site facilities have the minimum network and hardware components required to establish basic network operations.  The alternate site emergency response facility and equipment (laptops/printers) are available for use.
  • A portion of the backup office facilities and equipment (workstations / printers) are available for your users.
  • The recovery of business application systems (ERP etc.) would require sourcing appropriate hardware (via hardware vendor)
  • All recovery documentation and required backup tapes are available offsite.
  • Current IT staff is available to perform recovery processes.  Additional resources are available from other your company locations.
Recovery Phases:
Phase 0  Day 1
     Disaster
     Ensure safety of employees
     Notification / communication / Formal Disaster Declaration
     Assembly of recovery team / roles
     Assessment of Impact, stability of recovery facility, and recovery timeframe
     Determine Recovery Strategy
     Order recovery tapes
Phase 1 Day 2-4
     Recover Core Network Components:
        • Exchange Server
        • File Servers
        • WAN connectivity
Phase 2 Day 5 - 10
     Recovery Core Business Applications
     ERP
     Mainframe Printing

Phase 3 Day 10 - 30
Complete Recovery of All Systems or reactivation of data center

Notification and Declaration of a Disaster
       The first and foremost objective when a disaster happens is to ensure the safety of all staff and takes precedence over any recovery activities.
       During a recovery process, recovery personnel must take appropriate and adequate rest breaks and use safety controls to ensure their personal safety.  The maximum length of a recovery shift is 12 hours and includes periodic rest breaks.

       The primary responsibility for declaring an IT/Network disaster and invoking the disaster recovery plan rests with the Manager of Information Technology.   Secondary responsibility rests with the Network Team Lead and Office and Information Services Team Leads in consultation with your company Leadership Team.  Specific responsibilities are:
  • Communicate disaster to your company Leadership team – what happened, why, when, initial assessment and recovery overview.
  • Identify and contact the IT recovery teams, recovery team leads as well as a recovery coordinator.  The recovery teams will be created from the existing IT organization based on who is available.  For a disaster requiring recovery to an alternate site, or where the recovery time is likely to exceed 12 hours at least 2 teams should be created. 
  • Facilitate the assessment of the disaster and development of a recovery plan.
  • Ongoing communication to your company Leadership team, management and employees as appropriate.  

Sunday, April 15, 2012

GUIDELINES FOR SITES WITH < 512K AVAILABLE DATA-ONLY TO DC PROMO

      I would like to share the GUIDELINES FOR SITES WITH 512K AVAILABLE DATA-ONLY CONNECTIONS and PROPOSE TO DC PROMO THIER LOCAL DCs. For sites which plan to install and promote (locally) an AD domain controller, a 512K available Data-only connection is the strong recommendation. And the only connection I am quite confident will succeed and not require additional support and effort.   If the site has a link lower than that, additional research needs to be done to ensure a smooth promotion and to minimize adverse business impact.

      My recommendation that we feel comfortable with. is 512K data only.  Anything under that, we may be able to try, but unfortunately, we cannot guarantee anything.  So it is up to the site to determine if they want to take on that risk.  For example, if a site’s circuit is going to be upgraded later anyway, they may want to wait.  Unless there's a business case that indicates a site cannot wait.
Considerations:
       Those at the site representing the business must understand and agree to the additional risk, should they elect to promote a DC over a <512K available data-only  link, which includes:
  • Very very slow connections to external resources, including applications, internet, etc. for a week or more.  They should expect 3 weeks.
  • A successful promotion at a similar site does not ensure other similar sites will be successful.  Because every site is different, and the databases increase daily.
Here are the rules:
       If a site MUST DC Promo over a smaller (than 512 available -  meaning part of the link isn’t dedicated to some other data stream like Mail, ERP, etc.-   data-only) or shared link,  the Server Infrastructure Team needs to talk to them to gather important information.
  • It is critical that the design teams understand what is going over the connection, so we can make an informed decision.
  • When promotions fail, it causes rework and could delay other sites.   Also, the rework always introduces some small amount of risk that something inadvertently corrupts the rest of the forest. 
  • Once we all understand what is going over the <512K link, the customer agrees to the risk, and the Server Infrastructure Team design team feels it will not adversely impact others, we can OK the attempt at a DC promotion.  Again, the site will be expected to significantly reduce any other traffic going over the link during the promotion, and also during the SMS build. Traffic what should be taken into consideration, and significantly curtailed includes:
          • voice traffic
          • external application traffic (like intranet, ERP)
          • Internet traffic
          • Promotions must start the Fri just before a weekend to ensure the best throughput.
          • Promote the DC at a well-connected site, and then ship to the other site whenever possible.
  • We can't make a lot of special allowances trying to make it work, If it promotes, it promotes.  And if it doesn't, it doesn't.  In many cases, we can try it, if the customers willing to take on the risk. And once all servers come up on site, we cannot be certain there will be no performance degradation when servers try to sync etc….  Again, there are too many variables. 
      We need to address and agree to the plan for <512K available data-only sites well in advance of their planned DC promotions.  That way, things can run as smoothly as possible.

Application Service Provider Checklist Examples

       The purpose of "Application Service Provider Checklist" is to obtain background information for those external vendors (3rd parties) that are currently providing or plan to provide external application hosting services for your business.
Items Service Provider Response
A1 Provide the name of the Application Service Provider (Outsourcer) and business address.
A2 Provide the name of the application to be hosted at the provider’s location.
A3 How long have you performed as or provided Application Service Provider (ASP) hosting services?
A4 How many applications do you provide hosting services for?
A5 How many customers do you currently support?
How many customers do you support for the application your company is interested in (if you host more than one application)?
A6 Do you provide both shared and dedicated infrastructure (application, database, O/S) hosting options?
a. How many customers utilize your shared infrastructure?
b. How many customers utilize your dedicated infrastructure?
c. Do you have separate database instances for your customers or do they share the same database?
d. Is the application, web, and database on separate servers?
e. How many application servers are used to host the application?
f. How many database servers are used to host the database?
g. For web-based environments, is the web server installed on the same server as the application? If no, how many web servers are used to support the application?
A7 What IT governance or security framework do you use for your control environment (COBIT, ISO17799, ISO 27002 internal policies and standards, etc.)?
A8 Do you have an internal and/or external audit function?
A9 Have you contracted with a 3rd party to provide an attestation of your control environment (i.e. SAS70 certified, BITS)?
a. Please indicate the name and how often performed
b. Note: for SAS70 please indicate - Type I, II
A10 Has or will a major acquisition (merger) occur in the next 6-12 months?
A11 What is your core business (expertise)?
Items Network Response
B1 Describe all end to end encryption methods currently supported (i.e.  SSL, HTTPS, VPN, IPSEC, SFTP) to securely transport data between you and your customers – include strength of cipher (i.e. 128 bit)?
B2 Describe all email encryption methods you currently support (i.e. TLS, PGP, etc.).
B3 Are strong authentication measures (i.e. two-factor authentication using RSA tokens or smartcards) used for remote access to your network or for remote administration of network devices (i.e. firewalls, routers, switches, IDS, etc)
Note: userid/password is single factor
B4 Is redundancy and/or failover employed for critical devices such as firewalls, servers, load balancers, etc.?  Please provide detail.
B5 Are intrusion detection or intrusion prevention systems used?
a. Network Based – where deployed
b. Host Based – where deployed
c. Application based – where deployed
B6 Please provide information about vulnerabilities assessments performed for your environment:
a. List the type of assessments performed (penetration tests, network vulnerability scanning, etc.)
b. Describe the scope of the assessments ( network perimeter, application assessment, etc.)
c. How often are they performed?
d. Are they performed by internal staff or external parties?
Items Operations   Response
C1 Where is the primary processing facility (data center) located?
C2 Are any functions outsourced to a 3rd party (i.e. application development, system or network admin, data center)?  Please describe.
C3 Is access to the datacenter where the IT infrastructure resides controlled by you or by a 3rd party?
C4 Describe your process for keeping abreast of security threats for network devices, database, and operating system components?
C5 Do you have procedures in place for incident response, escalation and investigation?
C6 Is a formal change control process used to manage and track customer change requests and changes to the application, database, network and operating system components?
C7 Are security threats (events) for the application, database and operating system logged and reviewed regularly?  How often?
C8 Do you have separate development, test and production environments?
C9 Does the application reside in the same domain as the applications used to support your business?
Does the application and its components reside on a separate VLAN from other applications?
C10 Is user access to the application controlled by the customer or the Application Service Provider (i.e. add/remove users, password management, assign roles, etc.)?
Items Disaster Recovery Response
D1 Do you have a documented Business Continuity and Disaster Recovery Plan to address short term and long term disruptions of service?
D2 Are the plans reviewed and tested at least annually?
D3 Describe customer involvement in the annual testing.
D4 Where is your alternate processing facility located?
D5 Is the alternate processing facility a hot-site or cold-site?  If other please explain.
D6 What type natural disasters are common in the region where the primary data center is located?

Saturday, April 14, 2012

New Infrastructure Systems

       This article contains the processes for New Infrastructure Systems.  The information contained within this document applies to sites that range from newly acquired sites to established sites.  These sites may range in size from small sites (0-300 PCs), medium (301-500 PCs), or Large (501-1000 PCs up).


Refer to this article if:

  • Your site is currently not on the planed, namespace, and you plan to install your company workstations or servers. (That entire process is described here.)
  • You need to perform a subset of a new infrastructure deployment, such as:
        • Installing a new subnet for a site
        • Setting up DHCP at a site
        • Setting up DNS at a site
        • Installing Organizational Unit (OU) structure at a site
        • Setting up AD (Sites and Services ) for a site
        • Installing an AD Domain Controller at a site
        • Installing a software management services server at a site
Site Services Definitions:  Subnets, DNS, DHCP, OU’s
Subnet Design

The network team, who makes updates to the router for all sites should plan/design as is required for the following:
  • IP readdressing
  • Subnet mask
Note:  This design is to plan minimum of 15 weeks before the infrastructure needs to be in place (your scheduled IP readdress date, or server/desktop deployment date). 

New sites
For those deploying newly acquired sites, the following steps will be required:
  • Design new IP Addresses
  • Design DNS Entries
  • Subnet Design
  • Work on site design
  • Procure hardware and installation – Routers, rack, servers, space, power requirements
Procure Hardware
       Determine Hardware Requirement Determine your site’s hardware requirements and place your order 15 weeks prior to Day 1 of deployments. Depending on your location and procurement processes, it can take anywhere between 2 weeks and 4 months to receive your hardware.

Server Builds
       The individual server teams can be taked control to finish the process, which takes 5-7 work days for each server.) for example
  • AD Server and DC Promo – DCs need 5-7 work days (assumes already racked, and turned  on “burning in the HW” for a  minimum of 48 hours, and ready to begin DC Promo by day 1) for servers to be built.  After building out the AD Server and promoting the DC, set up the trust between CT and the resource domain. 
  • Software Management Services Server - SMS needs 5-7 work days (assumes already racked, built to Brand and turned on “burning in the HW” for a minimum of 48 hrs by day 1).  Do not plan on deploying workstations until a minimum of 10 work days after the SMS server is complete to avoid workstation deployment delay. The SMS and Exchange servers must wait for the DC server to be completed.  Exchange is also dependent on SMS being available. 
  • DHCP/DNS Build (Create/Delegate DHCP Scope) – NS Servers need 5-7 work days for completion. 
  • Exchange Server - The Active Directory and SAN storage? will need to be in place prior to server installation. Mailboxes can be migrated as soon as servers and storage are in place.  Note that if Exchange will be at the site, then a DC must be at the site as well.
  • Print Server -  Build up to brand 8 weeks prior to PC Deployment

Data Center Fire Sprinkler System

       This standard operating procedure template provides guidelines for Data Center Fire Sprinkler System, operating, inspecting, and maintaining the fire sprinkler system at your site(s). This will be achieved by a workforce including, but not limited to, contractors, vendor partners and employees who consistently apply safe work practices. Safe work practices, including emergency procedures in the event of disasters, support Operational Excellence. A safe and secure environment helps ensure the health and well being of all individuals (including workforce and visitors) as well as minimizes the impact of incidents that could affect business operations.

      To mitigate potential and preventable impacts to data center operations that could be caused by a fire going undetected through the application of industry standards and current best practices with the goal of ensuring 100% operability and reliability when called upon for service.

      Workforce members should assist each other in following the guidelines.

Input(s)
  • The Facilities Team is responsible for maintaining an operational fire sprinkler system.
  • See contact list for appropriate contacts.
  • Any use of chemicals, cleaner, lubricants, etc. in support of data center operations requires a material safety data sheet be submitted and approved by the data center operations manager prior to being brought onto the site.
  • The workforce is responsible for proper housekeeping practices, including the storage of tools and equipment during and after the work, cleanup, and waste disposal.
  • All tools and equipment must be stored in a pre-approved area, or removed from the work site.
  • All regulatory and company safe work practices must be followed where applicable, including but not limited to personal protective equipment, safety barriers lockout/tagout, and confined space entry.
  • Any member of the workforce may stop any work in progress due to unsafe work methods, conditions causing the area to be unsafe, emergency, or for any other data center operational necessity. All unsafe work methods will be reported to the data center operations manager who will investigate the situation and take corrective action.
  • Depending on organizational capability to perform the required work and/or maintenance, it may be necessary to contract with a qualified third party for some or all of the work described below. Note that many authorities having jurisdiction require that a licensed fire protection contractor perform at least some of the work on these systems.  For instance, sprinkler head inspections are typically done by facility staff, while the replacement of a pre-action valve is typically done by a licensed contractor. These requirements shall be verified for each site prior to implementing this standard operating procedure.
System Description
—Include Site System Information Here
  • The fire sprinkler system consists of a fire alarm panel, pump panel, pumps, piping, valves, and sprinkler heads.
  • Location—Indicate where the system is installed to include the risers, pumps, fire alarm panel, and other system components. Indicate if the system is installed under the raised floor, above the dropped ceiling or only at the rack level as applicable.
  • Specifically describe the operation of the pump panel.  Define how to run the pumps at no flow, minimum flow, rated flow, and peak flow conditions.
  • Indicate if the use of passwords is required.
  • Indicate the location of the Operations and Maintenance manuals for further reference.
  • Reference the Fire Alarm Panel and Detection standard operating procedure for the following:
     • Detection sequence leading to alarm and discharge
     • How an alarm or fault is indicated both at the panel and at the horns and strobes for normal, alarm and fault conditions.
     • Fire panel buttons, menus and diagnostic testing.
     • Silencing alarms, clearing/resetting the system.
     • Alarm notification sequence and expected response times. 
Outputs
       Data center manager shall ensure that a logbook is kept up to date with all inspection results and required actions.


Metrics

  • Fire sprinkler system remains 100% available.
  • No false alarms.

Friday, April 13, 2012

Data Center Access Policy and Guidelines

       This procedure provides Data Center Access Policy and Guidelines for the process of providing short term (visitor) and long term data center access. This will be achieved by a workforce (i.e. anyone conducting work including, but not limited to contractors, vendor partners, employees, etc.) who consistently apply safe work practices, including emergency procedures in the event of disasters, to meet your business needs.  A safe and secure environment will help to assure the health and well being of all individuals (including workforce and visitors) as well as minimize the impact of incidents that could affect business operations. 
      To mitigate potential and preventable impacts to facilities IT equipment caused  by unauthorized data center access through the application of industry standards, governmental regulations and current best practices with the goal of ensuring 100% operability and reliability when called upon for service.
 
       Workforce members should assist each other in following the guidelines.


Input(s)

  • The Data Center Facilities Team Lead is responsible to ensure that all IT personnel adhere to the policy of applying for data center access.
  • See contact list for appropriate contacts.
  • Any member of the workforce may STOP any WORK in progress due to unsafe work methods, conditions causing the area to be unsafe, emergency, or for any other Data Center operational necessity.  All unsafe work methods will be reported to the Data Center Operations Team Lead who will investigate the situation and take corrective action
Physical Security
  • Physical access to all computer rooms must be tightly controlled.  Doors must be locked at all times with only authorized personnel having access.
      -
    All employees, contractors, and visitors on company premises must wear identification tags at all times.
      -
    All visitors and vendors MUST be approved by the Data Center Facilities authorized approvers, visitor access added to their smart card (visitor card), and must be escorted while in the data center.
      - Authorized personnel must not allow unknown or unauthorized individuals into restricted areas.  Unauthorized or unknown personnel not accompanied by authorized personnel, particularly in computer areas, must be challenged (in a tactful manner).  Personnel without a valid reason for being in the computer room must be escorted out of the computer room immediately and Security must be contacted.
      -
    Security in the Data Centers is the responsibility of the Data Center Facilities Team Lead.  The Data Center Facilities Team Lead will manage security by:
            **
    Daily reviews and processing of your facilities Data Center Access
            ** Weekly reviews - Door Checks
            ** Quarterly review of all individuals with Data Center Access
            ** Yearly Documentation and Review
Outputs
        Data Center Facilities Team Lead shall ensure that records and documentation is kept with all checklist results, door checks and required maintenance actions.


Metrics
       Check list results are maintained and available at all times.


Contacts
  • Data Center Facilities Team Lead – xxx who will be assigned
  • Data Center Facilities Coordinator – xxx who will be assinged
  • Emergency Back Up – xxx who will be assigned

Thursday, April 12, 2012

Best Practices of RAISED FLOOR

       The purpose of this article is to provide a basis of "Best Practices of RAISED FLOOR" for data center management and infrastructure. This can be applied to your data center. Insuring that the raised floor is structurally sound, well grounded and maintained in a proper manner contributes not only to the overall reliability of the data center; but, to safety as well.

Design

      The following should be considered and implemented in the design of the data center raised floor.
  • The raised floor grid should be grounded to the ground reference (meaning earth ground).
  • Use "white space" to spread out the equipment and prevent hot spots
  • A floor tile layout grid marked either on the walls or the tiles themselves will provide an easy reference to any equipment location. This grid layout can also be indicated on any columns
  • Any sub-floor infrastructure (valves, electrical panels, etc.) should be noted with signage. The same applies to any infrastructure in the ceiling that cannot be seen
  • Fiber cabling runs in the raised floor should be protected in either metal cable raceways or by some other  protection method
  • Servers should be elevated off of the floor so that air intakes do not become clogged with dust and dirt
  • In areas where earthquakes are prevalent, earthquake restraints should be installed on all equipment, storage cabinets and shelving
  • In earthquake prone areas, equipment should be installed on the Iso-Base product.
  • All equipment racks should be equipped with a power distribution unit. The use of power strips to connect to electrical circuits is viewed as a safety hazard.

Operations
      The following should be considered and implemented in the operations of the data center raised floor.
  • A proactive maintenance program is critical to long-term effectiveness of any program designed for system availability.
  • A data center audit should be conducted on a regular basis by an outside, impartial firm
  • The raised floor and pedestals should be checked on a regular basis for leveling and fit. Warped, protruding or badly fitting tiles should be replaced.
  • The sub-floor are should be cleaned with a vacuum on a regular basis in order to prevent debris from being blown into the equipment racks.
  • Tiles that have galvanized, non-painted surfaces on the bottom of the tiles should be checked for zinc whiskers.  The zinc in galvanized tiles will come off in 2-micron whiskers. These may be blown into the equipment racks.
  • Unused data cables should be removed from the sub-floor. This is a NFPA code requirement.
  • Marking tiles or using contrasting color tiles to indicate where equipment can be installed versus reserved space for infrastructure or white space alleviates layout confusion
  • Use yellow hazard tape on the tiles to guide foot traffic away from critical or non workspace areas
  • Raised flooring that has electroplated passivated sheet metal bottoms with wood cores is subject to zinc whiskers. These are approximately 2 microns thick, and can be blown into equipment. Inspect the bottoms of the tiles with a flashlight at an oblique angle. If the surface twinkles, zinc whiskers are present.
  • At no time should wire spools with excess wire or cable be left under the floor. These contribute to poor air distribution
  • Equipment should be unpacked outside the data center to reduce the amount of debris and dirt circulating in the space
  • Storage of cardboard boxes in the data center should not be permitted. Doing so is a fire hazard and contributes to the amount of dust and fiber circulated into the equipment.
  • No food or drink should be allowed in the data centers at any time.

Infrastructure Equipment

Infrastructure Equipment
       The purpose of this article is to provide a basis of "Best Practices of Infrastructure Equipment" for data center management and infrastructure. Local business reasons, local governmental code or other circumstances may mitigate the implementation of these best practices.


       It should be understood that the implementation of "Best Practices of Infrastructure Equipment" may require long term processes to be implemented, capital or expense funding and that the corrective actions will be ongoing. No work activities can be performed that have the potential to impact services without the prior approval of the appropriate management personnel.
  • Crankcase vapor recover systems on the diesel generators help keep the fins clean and heat transfer efficient
  • Fuel filters on the diesel generators should have water detection sensors installed
  • Having multiple fuel and oil filters with bypass valving on the diesel generators allows filter replacement while the generators are running
  • Re-circulation dampers reduce CFM air flow during cold weather
  • The installation of an insulating jacket of the diesel generator exhaust pipes will help to reduce ambient temperatures inside enclosed engine rooms
  • Compressed air should be piped directly to the diesel generator locations, with the proper quick connect fittings, to provide easy connection points for impact tools
  • The use of aircraft type of hose (metal braided exterior) vs. the rubber hoses on diesel generators will provide lengthened hose life
  • Diesel generator crankcase heaters should have cutoff valves installed for easy hose replacement
  • Pre-filled oil filter cartridges and an electric lube oil transfer pump can reduce downtime for an oil change
  • Engine start batteries should have clear plastic covers installed to prevent accidents
  • Permanently installed load banks will allow UPS systems and generators to be tested on a regular basis. Both UPS systems and generators should be tested on a monthly basis
  • A log sheet for recording UPS maintenance should be readily available, and preferably attached to the main UPS panels
  • All circuit breakers should be marked with a color coded schema to indicate whether they are normally open or closed
  • Main UPS and switch gear circuit breakers should be protected from accidental operation
  • Spare fuses should be stocked near the points where they are used
  • Environmental equipment systems labeling should be thorough, consistent and clear
  • Acid absorbent materials should be installed below UPS batteries
  • The installation of a permanent chain host in the UPS battery room will facilitate battery replacement and reduce the chance of personnel injury
  • Individual cell equalizers installed on the UPS batteries will ensure that each cell gets the exact charging voltage required for optimal battery performance
  • The use of a battery watering cart, complete with a bulk supply of de-ionized water and a small pump, will reduce the time required for topping off batteries
  • If battery lugs have multiple bolt holes, multiple bolts should be used to prevent constraint of current flow
  • Eye wash stations in the battery rooms should be interconnected to the alarm system, with appropriate organizations notified when the station is used
  • All piping should be numbered, color coded and flow direction noted
  • External connection points on the chilled water piping would allow the use of truck mounted chiller units in case of emergency
  • Breathing apparatus should be located in the proper area for the number of personnel assigned to the area
  • The condenser and cooling tower water control valves should have a manual override with a valve position indicator clearly marked
  • Cooling tower sump water should be checked on a daily basis for clearness. The use of a portable sand filter for the cooling towers is advantageous
  • An active biocide program should be initiated to prevent growth in the condensate drains for the CRAHs
  • Spill containment equipment and programs should be instituted in all spaces
  • Having an abnormal operating configurations or conditions posted on a status board available to all shifts will allow each shift to immediately be informed about any conditions
  • Each door exiting the equipment plants should have a telephone, flashlight, fire extinguisher and emergency procedures
  • Lockout and tag out tools are clearly labeled and stored in such a manner that any missing tool is apparent
  • Spare parts cabinets should be well stocked. Parts inventory sheets direct mechanics and engineers to the appropriate spares quickly
  • Fire stopping between spaces is consistent and well done
  • Fire detection heads should be on flexible conduit for easy relocation
  • Cooling condenser coils should be kept clean. A regular maintenance program should exist on the coils. Dirty coils reduce capacity
  • Cooling tower piping should be protected against freezing on cold days