IT Adminstrators

Saturday, April 21, 2012

Recognizing Symptoms

2:24 AM TIPS AND TRICKS, Troubleshooting No comments

The first step to resolving any problem is to identify and interpret the symptoms."Recognizing Symptoms" The first step to resolving any problem is to identify and interpret the symptoms. You may discover network problems in several ways. Users may complain that the network seems slow or that they cannot connect to a server. You may pass your network management station and notice that a node icon is red. Your beeper may go off and display the message:WAN connection down.

User Comments
Although you can often solve networking problems before users notice a change in their environment, you invariably get feedback from your users about how the network is running, such as:

They cannot print.

They cannot access the application server.

It takes them much longer to copy files across the network than it usually does.

They cannot log on to a remote server.

When they send e-mail to another site, they get a routing error message.

Their system freezes whenever they try to Telnet.

Network Management Software Alerts
Network management software, as described in"Your Network Troubleshooting Toolbox", can alert you to areas of your network that need attention. For example:

The application displays red (Warning) icons.

Your weekly Top-N utilization report (which indicates the 10 ports with the highest utilization rates) shows that one port is experiencing much higher utilization levels than normal.

You receive an e-mail message from your network management station that the threshold for broadcast and multicast packets has been exceeded.

These signs usually provide additional information about the problem, allowing you to focus on the right area.

Analyzing Symptoms

When a symptom occurs, ask yourself these types of questions to narrow the location of the problem and to get more data for analysis:

To what degree is the network not acting normally (for example, does it now take one minute to perform a task that normally takes five seconds)?

On what subnetwork is the user located?

Is the user trying to reach a server, end station, or printer on the same subnetwork or on a different subnetwork?

Are many users complaining that the network is operating slowly or that a specific network application is operating slowly?

Are many users reporting network logon failures?

Are the problems intermittent? For example, some files may print with no problems, while other printing attempts generate error messages, make users lose their connections, and cause systems to freeze. " You may discover network problems in several ways. Users may complain that the network seems slow or that they cannot connect to a server. You may pass your network management station and notice that a node icon is red. Your beeper may go off and display the message:WAN connection down.

Troubleshooting Strategy

2:15 AM TIPS AND TRICKS, Troubleshooting No comments

How do you know when you are having a network problem? The answer to this question depends on your site's network configuration and on your network's normal behavior. See"Knowing Your Network" for more information.
If you notice changes on your network, ask the following questions:

Is the change expected or unusual?
Has this event ever occurred before?
Does the change involve a device or network path for which you already have a backup solution in place?
Does the change interfere with vital network operations?
Does the change affect one or many devices or network paths?

After you have an idea of how the change is affecting your network, you can categorize it as critical or noncritical. Both of these categories need resolution (except for changes that are one-time occurrences); the difference between the categories is the time that you have to fix the problem.

By using a strategy for network troubleshooting, you can approach a problem methodically and resolve it with minimal disruption to network users. It is also important to have an accurate and detailed map of your current network environment. Beyond that, a good approach to problem resolution is:

Identifying and Testing the Cause of the Problem

Solving the Problem

Best Practices for Change Management Process

5:25 PM TIPS AND TRICKS No comments

This is to provide Best Practices for Change Management Process. Change requests for Mission Critical or Significant applications, systems that contain or access High Integrity or Very High Integrity data, systems that contain or access Classified or Confidential-Restricted Access information, and infrastructure components that support these applications and systems must be documented via a Reporting Unit approved change request form. That change request form requires approval by the Information Steward or Delegate. The change request form must, at a minimum, contain the following information:

Who is initiating the change
Who is responsible for implementing the change
Who is responsible for the approval
Business justification for the change
Nature of defect (if applicable)
Testing required and who is responsible for the testing
Back-out procedures
Systems impacted
User contact

Applies to

Information Steward : approver
Systems Administrator : logger
Developer : programmer
Publisher : publish to production environment
Users : User Acceptance testing
IP Coordinator : communicates the process to regional/local IP communities

How To Prioritization for Incidents

5:51 AM Troubleshooting No comments

What is a incident?An Incident is a system bug or error, user question, or routine administration request.
Defect Categories Defined –

High      Incident of highest relative urgency. Essential Suite may be severely impacted and end-users require immediate assistance. The situation meets one or more of the following criteria:
      1. Any issue that significantly increases the likelihood of a safety or environmental incident occurring and/or the consequence of that potential event
      2. A Mission Critical business process is impacted and no workaround exists.
      3. Impacts 100 users or more.
      4. Work is totally stopped.
      5. System is down completely.

Medium

     Significant problem for the end-user, may result in financial or other serious impact for Essential Suite. Situation may become of high priority if not quickly addressed. The situation is not high, but meets one or more of the following criteria:
      1. A significant business process is impacted but a workaround exists.
      2. Impacts 50 to 99 users.
      3. Significant loss of work capacity, but can get some work done.

Incident Classification
We classify incidents based on the scenarios defined below:

High – System down related issues
Medium – User has classified it as moderate priority based on criteria, access related issue, etc.
Low – Updating records in system, Scheduling report, Data Mining, Close action items issue, Troubleshooting issues

Times:	High	Medium	Low
Initial Response Time	<= 2 Hours	<= 24 Hours	<= 2 Business Days
Restoration Time for an incident	<= 24 Hours	<= 2 Business Days	<= 5 Business Days

Best Practice to contact an end user

9:04 AM TIPS AND TRICKS No comments

Best Practice to contact an end user when a trouble ticket is open and assigned to the analyst. This suggests an obligation on the part of the customer to be available to troubleshoot the issue. We must give the customer every opportunity to be available for said troubleshooting -- within reason...

Tips and Pointers

Try multiple methods of contact to ensure that every effort has been made to connect with the customer.
You can include the final closing email as a saved .msg file in Attachments if you wish to preserve formatting.
Try to keep contact efforts to 24 hour increments so that the ticket does not linger overly long.
Be polite and thorough in your communications and documentation.
Just a reminder, management is collecting metrics on the length of time that tickets remain open, especially tickets with no activity so keep your documentation up-to-date.
DOCUMENT ALL ATTEMPTS!

Example Email Template
Template 1
Subject: PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>. I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email. Please let me know your availability at your earliest convenience.
Thank you,
<<Analysts Name>>

Template 2
Subject: 2nd Attempt. PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>. I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email. Please let me know your availability at your earliest convenience.
Thank you,
<<Analysts Name>>

Template 3
Subject: 3nd Attempt. PC Service Request <<case#>> Action Required
Hello,
I am with <<location>> IT End User Support and have received your PC Service Request, case # <<case#>>. I would like to help you with <<Short description of PC problem>>, but have not been able to get in touch with you by phone or email.
This is our third attempt to contact you. I will close this case unless I hear from you by end of business today. If you still need assistance with the same issue, please call the Help Desk at 000 000 (or 000-000-0000 after hours) and have your current ticket reopened within 20 days.
Thank you,
<<Analyst name>>

**please replace << >> with the specific information mentioned within brackets

Network Recovery Strategy ~ BCP

5:15 AM Servers, TIPS AND TRICKS No comments

This is to show example for Network Recovery Strategy ~ BCP for network part that can be applied to your business. The network and operations facilities at data center provide business application systems for your business and include:

Core Network Services: (Exchange Email, File and Print Services, Internet / Intranet)
Business Applications: (ERP, Financials, A/R, A/P, AM, billing), Mainframe printing.
User Workstations and associated application software for approximately 150 users in Marketing, Finance, Staff / IT etc.

Disaster Classification:

Level 1 – Temporary (less than 7 days) Loss of power / water to your building. This would require the shutdown of the computer room and servers but loss of equipment or data would be minimal.
Level 2 – Significant (greater than 7 days) – building cannot be occupied (fire/water damage, disease, other threat) but city infrastructure is intact.
Level 3 – Significant widespread damage to the city infrastructure (earthquake). Many core services are unavailable; employees are unable to report to work etc.

The IT recovery strategy is primarily designed to respond to level 1 or level 2 disasters. Level 3 disasters are within the scope of your business resumption plan where the primary focus is on ensuring the safety and security of employees and company assets and providing disaster assistance to the community.

Key Assumptions Example:

The primary recovery site for your site is the …
The backup site facilities have the minimum network and hardware components required to establish basic network operations. The alternate site emergency response facility and equipment (laptops/printers) are available for use.
A portion of the backup office facilities and equipment (workstations / printers) are available for your users.
The recovery of business application systems (ERP etc.) would require sourcing appropriate hardware (via hardware vendor)
All recovery documentation and required backup tapes are available offsite.
Current IT staff is available to perform recovery processes. Additional resources are available from other your company locations.

Recovery Phases:

Phase 0 Day 1
     Disaster
     Ensure safety of employees
     Notification / communication / Formal Disaster Declaration
     Assembly of recovery team / roles
     Assessment of Impact, stability of recovery facility, and recovery timeframe
     Determine Recovery Strategy
     Order recovery tapes

Phase 1 Day 2-4
     Recover Core Network Components:
        • Exchange Server
        • File Servers
        • WAN connectivity

Phase 2 Day 5 - 10
     Recovery Core Business Applications
     ERP
     Mainframe Printing

Phase 3 Day 10 - 30
Complete Recovery of All Systems or reactivation of data center

Notification and Declaration of a Disaster
The first and foremost objective when a disaster happens is to ensure the safety of all staff and takes precedence over any recovery activities.
During a recovery process, recovery personnel must take appropriate and adequate rest breaks and use safety controls to ensure their personal safety. The maximum length of a recovery shift is 12 hours and includes periodic rest breaks.

The primary responsibility for declaring an IT/Network disaster and invoking the disaster recovery plan rests with the Manager of Information Technology. Secondary responsibility rests with the Network Team Lead and Office and Information Services Team Leads in consultation with your company Leadership Team. Specific responsibilities are:

Communicate disaster to your company Leadership team – what happened, why, when, initial assessment and recovery overview.
Identify and contact the IT recovery teams, recovery team leads as well as a recovery coordinator. The recovery teams will be created from the existing IT organization based on who is available. For a disaster requiring recovery to an alternate site, or where the recovery time is likely to exceed 12 hours at least 2 teams should be created.
Facilitate the assessment of the disaster and development of a recovery plan.
Ongoing communication to your company Leadership team, management and employees as appropriate.

GUIDELINES FOR SITES WITH < 512K AVAILABLE DATA-ONLY TO DC PROMO

6:12 AM Servers, TIPS AND TRICKS No comments

I would like to share the GUIDELINES FOR SITES WITH 512K AVAILABLE DATA-ONLY CONNECTIONS and PROPOSE TO DC PROMO THIER LOCAL DCs. For sites which plan to install and promote (locally) an AD domain controller, a 512K available Data-only connection is the strong recommendation. And the only connection I am quite confident will succeed and not require additional support and effort. If the site has a link lower than that, additional research needs to be done to ensure a smooth promotion and to minimize adverse business impact.

My recommendation that we feel comfortable with. is 512K data only. Anything under that, we may be able to try, but unfortunately, we cannot guarantee anything. So it is up to the site to determine if they want to take on that risk. For example, if a site’s circuit is going to be upgraded later anyway, they may want to wait. Unless there's a business case that indicates a site cannot wait.

Considerations:
Those at the site representing the business must understand and agree to the additional risk, should they elect to promote a DC over a <512K available data-only link, which includes:

Very very slow connections to external resources, including applications, internet, etc. for a week or more. They should expect 3 weeks.
A successful promotion at a similar site does not ensure other similar sites will be successful. Because every site is different, and the databases increase daily.

Here are the rules:
If a site MUST DC Promo over a smaller (than 512 available - meaning part of the link isn’t dedicated to some other data stream like Mail, ERP, etc.- data-only) or shared link, the Server Infrastructure Team needs to talk to them to gather important information.

It is critical that the design teams understand what is going over the connection, so we can make an informed decision.
When promotions fail, it causes rework and could delay other sites. Also, the rework always introduces some small amount of risk that something inadvertently corrupts the rest of the forest.
Once we all understand what is going over the <512K link, the customer agrees to the risk, and the Server Infrastructure Team design team feels it will not adversely impact others, we can OK the attempt at a DC promotion. Again, the site will be expected to significantly reduce any other traffic going over the link during the promotion, and also during the SMS build. Traffic what should be taken into consideration, and significantly curtailed includes:
      • voice traffic
      • external application traffic (like intranet, ERP)
      • Internet traffic
      • Promotions must start the Fri just before a weekend to ensure the best throughput.
      • Promote the DC at a well-connected site, and then ship to the other site whenever possible.
We can't make a lot of special allowances trying to make it work, If it promotes, it promotes. And if it doesn't, it doesn't. In many cases, we can try it, if the customers willing to take on the risk. And once all servers come up on site, we cannot be certain there will be no performance degradation when servers try to sync etc…. Again, there are too many variables.

We need to address and agree to the plan for <512K available data-only sites well in advance of their planned DC promotions. That way, things can run as smoothly as possible.

Application Service Provider Checklist Examples

4:35 AM Servers, TIPS AND TRICKS No comments

The purpose of "Application Service Provider Checklist" is to obtain background information for those external vendors (3rd parties) that are currently providing or plan to provide external application hosting services for your business.

Items	Service Provider	Response
A1	Provide the name of the Application Service Provider (Outsourcer) and business address.
A2	Provide the name of the application to be hosted at the provider’s location.
A3	How long have you performed as or provided Application Service Provider (ASP) hosting services?
A4	How many applications do you provide hosting services for?
A5	How many customers do you currently support?
A5	How many customers do you support for the application your company is interested in (if you host more than one application)?
A6	Do you provide both shared and dedicated infrastructure (application, database, O/S) hosting options? a. How many customers utilize your shared infrastructure? b. How many customers utilize your dedicated infrastructure? c. Do you have separate database instances for your customers or do they share the same database? d. Is the application, web, and database on separate servers? e. How many application servers are used to host the application? f. How many database servers are used to host the database? g. For web-based environments, is the web server installed on the same server as the application? If no, how many web servers are used to support the application?
A7	What IT governance or security framework do you use for your control environment (COBIT, ISO17799, ISO 27002 internal policies and standards, etc.)?
A8	Do you have an internal and/or external audit function?
A9	Have you contracted with a 3rd party to provide an attestation of your control environment (i.e. SAS70 certified, BITS)? a. Please indicate the name and how often performed b. Note: for SAS70 please indicate - Type I, II
A10	Has or will a major acquisition (merger) occur in the next 6-12 months?
A11	What is your core business (expertise)?
Items	Network	Response
B1	Describe all end to end encryption methods currently supported (i.e. SSL, HTTPS, VPN, IPSEC, SFTP) to securely transport data between you and your customers – include strength of cipher (i.e. 128 bit)?
B2	Describe all email encryption methods you currently support (i.e. TLS, PGP, etc.).
B3	Are strong authentication measures (i.e. two-factor authentication using RSA tokens or smartcards) used for remote access to your network or for remote administration of network devices (i.e. firewalls, routers, switches, IDS, etc) Note: userid/password is single factor
B4	Is redundancy and/or failover employed for critical devices such as firewalls, servers, load balancers, etc.? Please provide detail.
B5	Are intrusion detection or intrusion prevention systems used? a. Network Based – where deployed b. Host Based – where deployed c. Application based – where deployed
B6	Please provide information about vulnerabilities assessments performed for your environment: a. List the type of assessments performed (penetration tests, network vulnerability scanning, etc.) b. Describe the scope of the assessments ( network perimeter, application assessment, etc.) c. How often are they performed? d. Are they performed by internal staff or external parties?
Items	Operations	Response
C1	Where is the primary processing facility (data center) located?
C2	Are any functions outsourced to a 3rd party (i.e. application development, system or network admin, data center)? Please describe.
C3	Is access to the datacenter where the IT infrastructure resides controlled by you or by a 3rd party?
C4	Describe your process for keeping abreast of security threats for network devices, database, and operating system components?
C5	Do you have procedures in place for incident response, escalation and investigation?
C6	Is a formal change control process used to manage and track customer change requests and changes to the application, database, network and operating system components?
C7	Are security threats (events) for the application, database and operating system logged and reviewed regularly? How often?
C8	Do you have separate development, test and production environments?
C9	Does the application reside in the same domain as the applications used to support your business? Does the application and its components reside on a separate VLAN from other applications?
C10	Is user access to the application controlled by the customer or the Application Service Provider (i.e. add/remove users, password management, assign roles, etc.)?
Items	Disaster Recovery	Response
D1	Do you have a documented Business Continuity and Disaster Recovery Plan to address short term and long term disruptions of service?
D2	Are the plans reviewed and tested at least annually?
D3	Describe customer involvement in the annual testing.
D4	Where is your alternate processing facility located?
D5	Is the alternate processing facility a hot-site or cold-site? If other please explain.
D6	What type natural disasters are common in the region where the primary data center is located?

New Infrastructure Systems

12:53 AM Servers, TIPS AND TRICKS No comments

This article contains the processes for New Infrastructure Systems. The information contained within this document applies to sites that range from newly acquired sites to established sites. These sites may range in size from small sites (0-300 PCs), medium (301-500 PCs), or Large (501-1000 PCs up).

Refer to this article if:

Your site is currently not on the planed, namespace, and you plan to install your company workstations or servers. (That entire process is described here.)

You need to perform a subset of a new infrastructure deployment, such as:
    • Installing a new subnet for a site
    • Setting up DHCP at a site
    • Setting up DNS at a site
    • Installing Organizational Unit (OU) structure at a site
    • Setting up AD (Sites and Services ) for a site
    • Installing an AD Domain Controller at a site
    • Installing a software management services server at a site

Site Services Definitions: Subnets, DNS, DHCP, OU’s
Subnet Design
The network team, who makes updates to the router for all sites should plan/design as is required for the following:

IP readdressing

Subnet mask

Note: This design is to plan minimum of 15 weeks before the infrastructure needs to be in place (your scheduled IP readdress date, or server/desktop deployment date).

New sites
For those deploying newly acquired sites, the following steps will be required:

Design new IP Addresses

Design DNS Entries

Subnet Design

Work on site design

Procure hardware and installation – Routers, rack, servers, space, power requirements

Procure Hardware
Determine Hardware Requirement Determine your site’s hardware requirements and place your order 15 weeks prior to Day 1 of deployments. Depending on your location and procurement processes, it can take anywhere between 2 weeks and 4 months to receive your hardware.

Server Builds
The individual server teams can be taked control to finish the process, which takes 5-7 work days for each server.) for example

AD Server and DC Promo – DCs need 5-7 work days (assumes already racked, and turned on “burning in the HW” for a minimum of 48 hours, and ready to begin DC Promo by day 1) for servers to be built. After building out the AD Server and promoting the DC, set up the trust between CT and the resource domain.

Software Management Services Server - SMS needs 5-7 work days (assumes already racked, built to Brand and turned on “burning in the HW” for a minimum of 48 hrs by day 1). Do not plan on deploying workstations until a minimum of 10 work days after the SMS server is complete to avoid workstation deployment delay. The SMS and Exchange servers must wait for the DC server to be completed. Exchange is also dependent on SMS being available.

DHCP/DNS Build (Create/Delegate DHCP Scope) – NS Servers need 5-7 work days for completion.

Exchange Server - The Active Directory and SAN storage? will need to be in place prior to server installation. Mailboxes can be migrated as soon as servers and storage are in place. Note that if Exchange will be at the site, then a DC must be at the site as well.

Print Server - Build up to brand 8 weeks prior to PC Deployment

Data Center Fire Sprinkler System

12:09 AM Servers, TIPS AND TRICKS No comments

       This standard operating procedure template provides guidelines for Data Center Fire Sprinkler System, operating, inspecting, and maintaining the fire sprinkler system at your site(s). This will be achieved by a workforce including, but not limited to, contractors, vendor partners and employees who consistently apply safe work practices. Safe work practices, including emergency procedures in the event of disasters, support Operational Excellence. A safe and secure environment helps ensure the health and well being of all individuals (including workforce and visitors) as well as minimizes the impact of incidents that could affect business operations.

      To mitigate potential and preventable impacts to data center operations that could be caused by a fire going undetected through the application of industry standards and current best practices with the goal of ensuring 100% operability and reliability when called upon for service.

      Workforce members should assist each other in following the guidelines.

Input(s)

The Facilities Team is responsible for maintaining an operational fire sprinkler system.

See contact list for appropriate contacts.

Any use of chemicals, cleaner, lubricants, etc. in support of data center operations requires a material safety data sheet be submitted and approved by the data center operations manager prior to being brought onto the site.

The workforce is responsible for proper housekeeping practices, including the storage of tools and equipment during and after the work, cleanup, and waste disposal.

All tools and equipment must be stored in a pre-approved area, or removed from the work site.

All regulatory and company safe work practices must be followed where applicable, including but not limited to personal protective equipment, safety barriers lockout/tagout, and confined space entry.

Any member of the workforce may stop any work in progress due to unsafe work methods, conditions causing the area to be unsafe, emergency, or for any other data center operational necessity. All unsafe work methods will be reported to the data center operations manager who will investigate the situation and take corrective action.

Depending on organizational capability to perform the required work and/or maintenance, it may be necessary to contract with a qualified third party for some or all of the work described below. Note that many authorities having jurisdiction require that a licensed fire protection contractor perform at least some of the work on these systems. For instance, sprinkler head inspections are typically done by facility staff, while the replacement of a pre-action valve is typically done by a licensed contractor. These requirements shall be verified for each site prior to implementing this standard operating procedure.

System Description
—Include Site System Information Here

The fire sprinkler system consists of a fire alarm panel, pump panel, pumps, piping, valves, and sprinkler heads.

Location—Indicate where the system is installed to include the risers, pumps, fire alarm panel, and other system components. Indicate if the system is installed under the raised floor, above the dropped ceiling or only at the rack level as applicable.

Specifically describe the operation of the pump panel. Define how to run the pumps at no flow, minimum flow, rated flow, and peak flow conditions.

Indicate if the use of passwords is required.

Indicate the location of the Operations and Maintenance manuals for further reference.

Reference the Fire Alarm Panel and Detection standard operating procedure for the following:
• Detection sequence leading to alarm and discharge
• How an alarm or fault is indicated both at the panel and at the horns and strobes for normal, alarm and fault conditions.
• Fire panel buttons, menus and diagnostic testing.
• Silencing alarms, clearing/resetting the system.
• Alarm notification sequence and expected response times.

Outputs
Data center manager shall ensure that a logbook is kept up to date with all inspection results and required actions.

Metrics

Fire sprinkler system remains 100% available.

No false alarms.

IT Adminstrators

Learn some helpful IT Administrator tips and tricks.

Cloud Computing Next Generation of your company

Support Tips and Tricks

Server, Network, System, Application | Diagram

Good roadmap for System Engineer, Network Engineer

Popular Posts

Categories

Cisco Networking Center

Phuket Travelling

Labels

Blog Archive

Saturday, April 21, 2012

Recognizing Symptoms

Troubleshooting Strategy

Friday, April 20, 2012

Best Practices for Change Management Process

Thursday, April 19, 2012

How To Prioritization for Incidents

Wednesday, April 18, 2012

Best Practice to contact an end user

Monday, April 16, 2012

Network Recovery Strategy ~ BCP

Sunday, April 15, 2012

GUIDELINES FOR SITES WITH < 512K AVAILABLE DATA-ONLY TO DC PROMO

Application Service Provider Checklist Examples

Saturday, April 14, 2012

New Infrastructure Systems

Data Center Fire Sprinkler System