IT Adminstrators

Monday, April 23, 2012

Solving the Problem

7:11 AM TIPS AND TRICKS, Troubleshooting No comments

Many device or network problems are straightforward to resolve, but others yield misleading symptoms. If one solution does not work, continue with another.
A solution often involves:

Upgrading software or hardware (for example, upgrading to a new version of agent software or installing Gigabit Ethernet devices)
Balancing your network load by analyzing:
• What users communicate with which servers
• What the user traffic levels are in different segments

Based on these findings, you can decide how to redistribute network traffic.

Adding segments to your LAN (for example, adding a new switch where utilization is continually high)
Replacing faulty equipment (for example, replacing a module that has port problems or replacing a network card that has a faulty jabber protection mechanism)

To help solve problems, have available:

Spare hardware equipment (such as modules and power supplies), especially for your critical devices
A recent backup of your device configurations to reload if flash memory gets corrupted (which can sometimes happen due to a power outage)

Why do we investigate incidents the key purpose of an investigation should be

- to preven a future recurrence of the incident

- determine root cause to prevent similar losses at the same or another location

- satisfy legal & company requirements and determine the company's liability

- benefit from lessons learned which may result in improved safety and operation

- inform employees by keep employees informed about the event and follow up action

Identifying and Testing the Cause of the Problem

5:40 AM TIPS AND TRICKS, Troubleshooting No comments

After you develop a theory about the cause of the problem, test your theory. The test must conclusively prove or disprove your theory.

Two general rules of troubleshooting are:

If you cannot reproduce a problem, then no problem exists unless it happens again on its own.
If the problem is intermittent and you cannot replicate it, you can configure your network management software to catch the event in progress.

      For example, with"LANsentry Manager", you can set alarms and automatic packet capture filters to monitor your network and inform you when the problem occurs again. See"Configuring Transcend NCS" for more information.

      Although network management tools can provide a great deal of information about problems and their general location, you may still need to swap equipment or replace components of your network until you locate the exact trouble spot.

      After you test your theory, either fix the problem as described in"Solving the Problem" or develop another theory.

Sample Problem Analysis
       This section illustrates the analysis phase of a typical troubleshooting incident. On your network, a user cannot access the mail server. You need to establish two areas of information:

What you know - In this case, the user's workstation cannot communicate with the mail server.
What you donot know and need to test-
Can the workstation communicate with the network at all, or is the problem limited to communication with the server? Test by sending a"Ping" or by connecting to other devices.
Is the workstation the only device that is unable to communicate with the server, or do other workstations have the same problem? Test connectivity at other workstations.
If other workstations cannot communicate with the server, can they communicate with other network devices? Again, test the connectivity.

The analysis process follows these steps:

Can the workstation communicate with any other device on the subnetwork?• Ifno, then go to step 2.
• Ifyes, determine if only the server is unreachable.
• If only the server cannot be reached, this suggests a server problem. Confirm by doing step 2.
• If other devices cannot be reached, this suggests a connectivity problem in the network. Confirm by doing step 3.
Can other workstations communicate with the server?
• Ifno, then most likely it is a server problem. Go to step 3.
• Ifyes, then the problem is that the workstation is not communicating with the subnetwork. (This situation can be caused by workstation issues or a network issue with that specific station.)
Can other workstations communicate with other network devices?• Ifno, then the problem is likely a network problem.
• Ifyes, the problem is likely a server problem.

When you determine whether the problem is with the server, subnetwork, or workstation, you can further analyze the problem, as follows:

For a problem with the server - Examine whether the server is running, if it is properly connected to the network, and if it is configured appropriately.
For a problem with the subnetwork - Examine any device on the path between the users and the server.
For a problem with the workstation - Examine whether the workstation can access other network resources and if it is configured to communicate with that particular server.

Equipment for TestingTo help identify and test the cause of problems, have available:

A laptop computer that is loaded with a terminal emulator, TCP/IP stack, TFTP server, CD-ROM drive (to read the online documentation), and some key network management applications, such as LANsentry Manager. With the laptop computer, you can plug into any subnetwork to gather and analyze data about the segment.
A spare managed hub to swap for any hub that does not have management. Swapping in a managed hub allows you to quickly spot which port is generating the errors.
A single port probe to insert in the network if you are having a problem where you do not have management capability.
Console cables for each type of connector, labeled and stored in a secure place.

Understanding the Problem

5:31 AM TIPS AND TRICKS, Troubleshooting No comments

Networks are designed to move data from a transmitting device to a receiving device. When communication becomes problematic, you must determine why data are not traveling as expected and then find a solution. The two most common causes for data not moving reliably from source to destination are:

The physical connection breaks (that is, a cable is unplugged or broken).
A network device is not working properly and cannot send or receive some or all data.

Network management software can easily locate and report a physical connection break (layer 1 problem). It is more difficult to determine why a network device is not working as expected, which is often related to a layer 2 or a layer 3 problem.

To determine why a network device is not working properly, look first for:

Valid service - Is the device configured properly for the type of service it is supposed to provide? For example, has Quality of Service (QoS), which is the definition of the transmission parameters, been established?
Restricted access - Is an end station supposed to be able to connect with a specific device or is that connection restricted? For example, is a firewall set up that prevents that device from accessing certain network resources?
Correct configuration - Is there a misconfiguration of IP address, subnet mask, gateway, or broadcast address? Network problems are commonly caused by misconfiguration of newly connected or configured devices.

Recognizing Symptoms

2:24 AM TIPS AND TRICKS, Troubleshooting No comments

The first step to resolving any problem is to identify and interpret the symptoms."Recognizing Symptoms" The first step to resolving any problem is to identify and interpret the symptoms. You may discover network problems in several ways. Users may complain that the network seems slow or that they cannot connect to a server. You may pass your network management station and notice that a node icon is red. Your beeper may go off and display the message:WAN connection down.

User Comments
Although you can often solve networking problems before users notice a change in their environment, you invariably get feedback from your users about how the network is running, such as:

They cannot print.

They cannot access the application server.

It takes them much longer to copy files across the network than it usually does.

They cannot log on to a remote server.

When they send e-mail to another site, they get a routing error message.

Their system freezes whenever they try to Telnet.

Network Management Software Alerts
Network management software, as described in"Your Network Troubleshooting Toolbox", can alert you to areas of your network that need attention. For example:

The application displays red (Warning) icons.

Your weekly Top-N utilization report (which indicates the 10 ports with the highest utilization rates) shows that one port is experiencing much higher utilization levels than normal.

You receive an e-mail message from your network management station that the threshold for broadcast and multicast packets has been exceeded.

These signs usually provide additional information about the problem, allowing you to focus on the right area.

Analyzing Symptoms

When a symptom occurs, ask yourself these types of questions to narrow the location of the problem and to get more data for analysis:

To what degree is the network not acting normally (for example, does it now take one minute to perform a task that normally takes five seconds)?

On what subnetwork is the user located?

Is the user trying to reach a server, end station, or printer on the same subnetwork or on a different subnetwork?

Are many users complaining that the network is operating slowly or that a specific network application is operating slowly?

Are many users reporting network logon failures?

Are the problems intermittent? For example, some files may print with no problems, while other printing attempts generate error messages, make users lose their connections, and cause systems to freeze. " You may discover network problems in several ways. Users may complain that the network seems slow or that they cannot connect to a server. You may pass your network management station and notice that a node icon is red. Your beeper may go off and display the message:WAN connection down.

Troubleshooting Strategy

2:15 AM TIPS AND TRICKS, Troubleshooting No comments

How do you know when you are having a network problem? The answer to this question depends on your site's network configuration and on your network's normal behavior. See"Knowing Your Network" for more information.
If you notice changes on your network, ask the following questions:

Is the change expected or unusual?
Has this event ever occurred before?
Does the change involve a device or network path for which you already have a backup solution in place?
Does the change interfere with vital network operations?
Does the change affect one or many devices or network paths?

After you have an idea of how the change is affecting your network, you can categorize it as critical or noncritical. Both of these categories need resolution (except for changes that are one-time occurrences); the difference between the categories is the time that you have to fix the problem.

By using a strategy for network troubleshooting, you can approach a problem methodically and resolve it with minimal disruption to network users. It is also important to have an accurate and detailed map of your current network environment. Beyond that, a good approach to problem resolution is:

Identifying and Testing the Cause of the Problem

Solving the Problem

Best Practices for Change Management Process

5:25 PM TIPS AND TRICKS No comments

This is to provide Best Practices for Change Management Process. Change requests for Mission Critical or Significant applications, systems that contain or access High Integrity or Very High Integrity data, systems that contain or access Classified or Confidential-Restricted Access information, and infrastructure components that support these applications and systems must be documented via a Reporting Unit approved change request form. That change request form requires approval by the Information Steward or Delegate. The change request form must, at a minimum, contain the following information:

Who is initiating the change
Who is responsible for implementing the change
Who is responsible for the approval
Business justification for the change
Nature of defect (if applicable)
Testing required and who is responsible for the testing
Back-out procedures
Systems impacted
User contact

Applies to

Information Steward : approver
Systems Administrator : logger
Developer : programmer
Publisher : publish to production environment
Users : User Acceptance testing
IP Coordinator : communicates the process to regional/local IP communities

How To Prioritization for Incidents

5:51 AM Troubleshooting No comments

What is a incident?An Incident is a system bug or error, user question, or routine administration request.
Defect Categories Defined –

High      Incident of highest relative urgency. Essential Suite may be severely impacted and end-users require immediate assistance. The situation meets one or more of the following criteria:
      1. Any issue that significantly increases the likelihood of a safety or environmental incident occurring and/or the consequence of that potential event
      2. A Mission Critical business process is impacted and no workaround exists.
      3. Impacts 100 users or more.
      4. Work is totally stopped.
      5. System is down completely.

Medium

     Significant problem for the end-user, may result in financial or other serious impact for Essential Suite. Situation may become of high priority if not quickly addressed. The situation is not high, but meets one or more of the following criteria:
      1. A significant business process is impacted but a workaround exists.
      2. Impacts 50 to 99 users.
      3. Significant loss of work capacity, but can get some work done.

Incident Classification
We classify incidents based on the scenarios defined below:

High – System down related issues
Medium – User has classified it as moderate priority based on criteria, access related issue, etc.
Low – Updating records in system, Scheduling report, Data Mining, Close action items issue, Troubleshooting issues

Times:	High	Medium	Low
Initial Response Time	<= 2 Hours	<= 24 Hours	<= 2 Business Days
Restoration Time for an incident	<= 24 Hours	<= 2 Business Days	<= 5 Business Days

Best Practice to contact an end user

9:04 AM TIPS AND TRICKS No comments

Best Practice to contact an end user when a trouble ticket is open and assigned to the analyst. This suggests an obligation on the part of the customer to be available to troubleshoot the issue. We must give the customer every opportunity to be available for said troubleshooting -- within reason...

Tips and Pointers

Try multiple methods of contact to ensure that every effort has been made to connect with the customer.
You can include the final closing email as a saved .msg file in Attachments if you wish to preserve formatting.
Try to keep contact efforts to 24 hour increments so that the ticket does not linger overly long.
Be polite and thorough in your communications and documentation.
Just a reminder, management is collecting metrics on the length of time that tickets remain open, especially tickets with no activity so keep your documentation up-to-date.
DOCUMENT ALL ATTEMPTS!

Example Email Template
Template 1
Subject: PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>. I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email. Please let me know your availability at your earliest convenience.
Thank you,
<<Analysts Name>>

Template 2
Subject: 2nd Attempt. PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>. I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email. Please let me know your availability at your earliest convenience.
Thank you,
<<Analysts Name>>

Template 3
Subject: 3nd Attempt. PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>. I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email.
This is our third attempt to contact you. I will close this case unless I hear from you by end of business today. If you still need assistance with the same issue, please call the Help Desk at 000 000 (or 000-000-0000 after hours) and have your current ticket reopened within 20 days.
Thank you,
<<Analyst name>>

**please replace << >> with the specific information mentioned within brackets

Network Recovery Strategy ~ BCP

5:15 AM Servers, TIPS AND TRICKS No comments

This is to show example for Network Recovery Strategy ~ BCP for network part that can be applied to your business. The network and operations facilities at data center provide business application systems for your business and include:

Core Network Services: (Exchange Email, File and Print Services, Internet / Intranet)
Business Applications: (ERP, Financials, A/R, A/P, AM, billing), Mainframe printing.
User Workstations and associated application software for approximately 150 users in Marketing, Finance, Staff / IT etc.

Disaster Classification:

Level 1 – Temporary (less than 7 days) Loss of power / water to your building. This would require the shutdown of the computer room and servers but loss of equipment or data would be minimal.
Level 2 – Significant (greater than 7 days) – building cannot be occupied (fire/water damage, disease, other threat) but city infrastructure is intact.
Level 3 – Significant widespread damage to the city infrastructure (earthquake). Many core services are unavailable; employees are unable to report to work etc.

The IT recovery strategy is primarily designed to respond to level 1 or level 2 disasters. Level 3 disasters are within the scope of your business resumption plan where the primary focus is on ensuring the safety and security of employees and company assets and providing disaster assistance to the community.

Key Assumptions Example:

The primary recovery site for your site is the …
The backup site facilities have the minimum network and hardware components required to establish basic network operations. The alternate site emergency response facility and equipment (laptops/printers) are available for use.
A portion of the backup office facilities and equipment (workstations / printers) are available for your users.
The recovery of business application systems (ERP etc.) would require sourcing appropriate hardware (via hardware vendor)
All recovery documentation and required backup tapes are available offsite.
Current IT staff is available to perform recovery processes. Additional resources are available from other your company locations.

Recovery Phases:

Phase 0 Day 1
     Disaster
     Ensure safety of employees
     Notification / communication / Formal Disaster Declaration
     Assembly of recovery team / roles
     Assessment of Impact, stability of recovery facility, and recovery timeframe
     Determine Recovery Strategy
     Order recovery tapes

Phase 1 Day 2-4
     Recover Core Network Components:
        • Exchange Server
        • File Servers
        • WAN connectivity

Phase 2 Day 5 - 10
     Recovery Core Business Applications
     ERP
     Mainframe Printing

Phase 3 Day 10 - 30
Complete Recovery of All Systems or reactivation of data center

Notification and Declaration of a Disaster
The first and foremost objective when a disaster happens is to ensure the safety of all staff and takes precedence over any recovery activities.
During a recovery process, recovery personnel must take appropriate and adequate rest breaks and use safety controls to ensure their personal safety. The maximum length of a recovery shift is 12 hours and includes periodic rest breaks.

The primary responsibility for declaring an IT/Network disaster and invoking the disaster recovery plan rests with the Manager of Information Technology. Secondary responsibility rests with the Network Team Lead and Office and Information Services Team Leads in consultation with your company Leadership Team. Specific responsibilities are:

Communicate disaster to your company Leadership team – what happened, why, when, initial assessment and recovery overview.
Identify and contact the IT recovery teams, recovery team leads as well as a recovery coordinator. The recovery teams will be created from the existing IT organization based on who is available. For a disaster requiring recovery to an alternate site, or where the recovery time is likely to exceed 12 hours at least 2 teams should be created.
Facilitate the assessment of the disaster and development of a recovery plan.
Ongoing communication to your company Leadership team, management and employees as appropriate.

GUIDELINES FOR SITES WITH < 512K AVAILABLE DATA-ONLY TO DC PROMO

6:12 AM Servers, TIPS AND TRICKS No comments

I would like to share the GUIDELINES FOR SITES WITH 512K AVAILABLE DATA-ONLY CONNECTIONS and PROPOSE TO DC PROMO THIER LOCAL DCs. For sites which plan to install and promote (locally) an AD domain controller, a 512K available Data-only connection is the strong recommendation. And the only connection I am quite confident will succeed and not require additional support and effort. If the site has a link lower than that, additional research needs to be done to ensure a smooth promotion and to minimize adverse business impact.

My recommendation that we feel comfortable with. is 512K data only. Anything under that, we may be able to try, but unfortunately, we cannot guarantee anything. So it is up to the site to determine if they want to take on that risk. For example, if a site’s circuit is going to be upgraded later anyway, they may want to wait. Unless there's a business case that indicates a site cannot wait.

Considerations:
Those at the site representing the business must understand and agree to the additional risk, should they elect to promote a DC over a <512K available data-only link, which includes:

Very very slow connections to external resources, including applications, internet, etc. for a week or more. They should expect 3 weeks.
A successful promotion at a similar site does not ensure other similar sites will be successful. Because every site is different, and the databases increase daily.

Here are the rules:
If a site MUST DC Promo over a smaller (than 512 available - meaning part of the link isn’t dedicated to some other data stream like Mail, ERP, etc.- data-only) or shared link, the Server Infrastructure Team needs to talk to them to gather important information.

It is critical that the design teams understand what is going over the connection, so we can make an informed decision.
When promotions fail, it causes rework and could delay other sites. Also, the rework always introduces some small amount of risk that something inadvertently corrupts the rest of the forest.
Once we all understand what is going over the <512K link, the customer agrees to the risk, and the Server Infrastructure Team design team feels it will not adversely impact others, we can OK the attempt at a DC promotion. Again, the site will be expected to significantly reduce any other traffic going over the link during the promotion, and also during the SMS build. Traffic what should be taken into consideration, and significantly curtailed includes:
      • voice traffic
      • external application traffic (like intranet, ERP)
      • Internet traffic
      • Promotions must start the Fri just before a weekend to ensure the best throughput.
      • Promote the DC at a well-connected site, and then ship to the other site whenever possible.
We can't make a lot of special allowances trying to make it work, If it promotes, it promotes. And if it doesn't, it doesn't. In many cases, we can try it, if the customers willing to take on the risk. And once all servers come up on site, we cannot be certain there will be no performance degradation when servers try to sync etc…. Again, there are too many variables.

We need to address and agree to the plan for <512K available data-only sites well in advance of their planned DC promotions. That way, things can run as smoothly as possible.

IT Adminstrators

Learn some helpful IT Administrator tips and tricks.

Cloud Computing Next Generation of your company

Support Tips and Tricks

Server, Network, System, Application | Diagram

Good roadmap for System Engineer, Network Engineer

Popular Posts

Categories

Cisco Networking Center

Phuket Travelling

Labels

Blog Archive

Monday, April 23, 2012

Solving the Problem

Sunday, April 22, 2012

Identifying and Testing the Cause of the Problem

Understanding the Problem

Saturday, April 21, 2012

Recognizing Symptoms

Troubleshooting Strategy

Friday, April 20, 2012

Best Practices for Change Management Process

Thursday, April 19, 2012

How To Prioritization for Incidents

Wednesday, April 18, 2012

Best Practice to contact an end user

Monday, April 16, 2012

Network Recovery Strategy ~ BCP

Sunday, April 15, 2012

GUIDELINES FOR SITES WITH < 512K AVAILABLE DATA-ONLY TO DC PROMO