Home page of Satellite Internet and Information

Satellite Internet Forum.

Welcome, Guest.
Welcome to this satellite broadband discussion forum. Wherever you are and whatever your problem we are here to help each other. Connecting to the internet via satellite is not always easy but is critically important to those in remote places or with poor terrestrial infrastructure. Both service providers and customers are encouraged to contribute. Register at the bottom of the forum home page if you wish to contribute or ask question. Read the Forum rules.
      Satellite Internet Forum : Home Page          
Pages: 1

NMS Server Hangs and Sudden Outage

(Read 7734 times)
Ex Member
Ex Member


Jun 30th, 2008 at 9:44am  
Dear

Last three days around morning 11 AM , Our whole Network went down for 20 mins or more. We restart the NRD server but it doesnt came back and later we restart the NMS server it came back to the network but it tooks more than 20 -30 mins for whole terminals to acquire the Network. I would like to know the possible reason why this problem arise and how we can solve it soon

    Yesterday when the same problem arise the NMS was stuck and when i was trying ti input the command it responds too slowly..

Please anyone help me to solve the issu


smlt
Back to top
 
 
IP Logged
 
Ex Member
Ex Member


Reply #1 - Jun 30th, 2008 at 12:03pm  
Questions:

1. What version are you using?

2. Is it happening at the same time every day?

3. What does the memory on your NMS look like at the time of the outage? (If you are not watching utilization on that box, you need to be. The TOP command is your friend). You may be running out of memory.

4. Do all of the remotes fall out of the network at the time of the outage? I need to know if the NODE observes the outage (I am trying to ensure that it is not just an OBSERVED outage on the NMS...and that the users have truly reported a break in IP connectivity).

If the outage is happening at the same time every day it is more than likely a cronjob that is crashing your boxes. What can happen is the NMS's and PPs can get overwhelmed trying to execute a cron (from the crontab)...to the point that they disregard their network duties (dynamic BW allocation, etc). It is either a cron or your boxes are simply running out of memory.

The slocate cron is notorious for doing just what I described above (overwhelming the RHEL3 boxes). It is executed everyday at a certain time (that actual time is dependent on the time you have set on your boxes (whatever your hwclock is set/synced to).  iDirect actually recommends that slocate.cron be turned off and there is a tech bulletin regarding the issue.

The lengthy re-acquisition of your remotes could be an entirely seperate issue and is possibly the result of remote bursts arriving too high (or low) at the configured network UCP. (thus, skewing the UCP). I have seen networks that are improperly commissioned (read: remotes with improper initial tx values) take as much as 20+ minutes to re-aquire the network...due to UCP skew. It can all be negated or minimized by Hub Network Operator properly commissioning each remote after they initially enter the network.

Back to top
« Last Edit: Jun 30th, 2008 at 10:01pm by N/A »  
 
IP Logged
 
Ex Member
Ex Member


Reply #2 - Jul 1st, 2008 at 1:31pm  


Thanks for the reply

version is 7.0.2

it was but today it happened at 7 AM.Ystrday we checked the NMS and PP and cleaned the database.

Yes all remotes fall out of the network.

Back to top
 
 
IP Logged
 
Ex Member
Ex Member


Reply #3 - Jul 1st, 2008 at 1:36pm  
Not telling you how to handle your business, but be very careful with that SQL Database.

So, the issue happened at a different time?
Back to top
 
 
IP Logged
 
Ex Member
Ex Member


Reply #4 - Jul 3rd, 2008 at 11:50am  
hai

again Brak in ip connectivity happened today in out two inroute groupsbut didnt affected one inroute.

what could be the problem . we checked our event server and NMS servers but process during that time is normal

what could be the reason
Back to top
 
 
IP Logged
 
Ex Member
Ex Member


Reply #5 - Jul 3rd, 2008 at 12:18pm  
Did it happen at the same time as any of the previous days?

Checking the NMS is a good idea (and recommended), but I also recommend you keep an eye on the PPs. If you are observing an outage, they are more than likely the culprit. I know I mentioned just the NMS above, but I recommend watching the utilizations on the PPs as well. If the service interuption is happening at the same time (or approximately the same time) everyday I am almost certain it is a cronjob crashing your network.

I recommend (as a precaution) that you get ahold of the tech bulletin from iDirect regarding slocate.cron and follow the instructions (on that document) to ensure that slocate.cron is suspended...once suspended, keep the network under observation to see if the outage persists (watch closely to see if it shows up again).

Do you have an iSupport agreement with iDirect that gives you access to the iDirect TAC webpages?
Back to top
« Last Edit: Jul 3rd, 2008 at 11:19pm by N/A »  
 
IP Logged
 
Ex Member
Ex Member


Reply #6 - Jul 5th, 2008 at 10:28am  

Let me now the situation in which we should suspend the cron job so as we can say closely say that slocate .cron is creating some problems which causes this outage.
Back to top
 
 
IP Logged
 
Ex Member
Ex Member


Reply #7 - Jul 5th, 2008 at 2:39pm  
I recommend you engage this ONLY if your are skilled in linux VI EDITOR. If you do not have a linux background or even understand the fundamentals of linux, I suggest your find someone who DOES understand the below prior to engaging. Engaging this without an understanding, could cause significant outage and is NOT recommended. Proceed at your own risk. Some background:

slocate.cron is default (scheduled) maintainence job that runs on your RHEL3 platforms at the same time every day. It runs on each RHEL3 box at 4:02am every day. The job causes the protocol processor to become too busy to process information in a timely manner....thus causing outage. My only concern thus far is you said it DID NOT affect one inroute. However, it doesnt hurt to suspend slocate.cron and see if it resolves your problem. Slocates function is to generate a database which accelerates the file search process when using either the slocate or the locate command.

Stopping this daily cron job results in not being able to take advantage of searching for files using these commands...which under most circumstances we (as HNOS) do not use this funtion/feature.

The file is located under under /etc/cron.daily. The filename is slocate.cron and contains the following:
#!/bin/sh
renice +19 -p $$ >/dev/null 2>&1
/usr/bin/updatedb -f "nfs,smbfs,ncpfs,proc,devpts" -e "/tmp,/var/tmp,/usr/tmp,/afs,/net"


Edit the file from what is shown above to:

#!/bin/sh
renice +19 -p $$ >/dev/null 2>&1
/usr/bin/updatedb -f "nfs,smbfs,ncpfs,proc,devpts" -e "/"


You must edit all of your RHEL3 boxes. Protocol Processor Blades and the NMS server(s. You do not have to restart the cron daemon. Also, Do NOT keep a copy of the original (or altered file) under the /etc/cron.daily directory.

All files under the directory are executed daily at designated time (4:02AM) including the saved copy.
Back to top
« Last Edit: Jul 11th, 2008 at 12:27am by N/A »  
 
IP Logged
 
Ex Member
Ex Member


Reply #8 - Jul 5th, 2008 at 3:08pm  


Thanks for the reply and i really appreciate ....

i ll try to do this but the JOB  happens at 4.02 and problems reported at After 10 AM  in that case is it mean that the Particular job took maximum process utilization and was too slow and when NMS and PP got some other task it couldnt handle all these tasks together which hangs the hardware.
Back to top
 
 
IP Logged
 
Ex Member
Ex Member


Reply #9 - Jul 5th, 2008 at 5:28pm  
Understand all. You need to see what your hardware clock (hwclock) is synced to. Your NMS time (in the lower right corner) MAY vary from your hwclock. 

For all we know suspending slocate may not resolve your problem....but it sure wont hurt to stop that job.....which is why I recommended it (to see if it is the culprit).

However...word of caution (again) if you are not skilled enough to navigate thru linux file directories (via command line) and perform some basic vi editing (with save) you may not want to engage this. I say this with a word of warning, as I dont want an unskilled person to cause an even larger problem from the command line If you decide to do this (and edit the cron), make sure you use the wq! to save your work when you are done (no need to reboot the RHEL3 boxes after your editing work is complete). At a minimum ensure slocate is suspended on both PPs.

Have fun with it.

M
Back to top
« Last Edit: Jul 6th, 2008 at 11:08pm by N/A »  
 
IP Logged
 
Ex Member
Ex Member


Reply #10 - Jul 9th, 2008 at 9:25am  
Hi

again one inroute is down fo about few mins . its not an complete outage . i checke the event server it shows bit busy the process status is 86% idle at that time . so it tells  tht event server was not bust at all and so what made it slow process then ?
Back to top
 
 
IP Logged
 
Ex Member
Ex Member


Reply #11 - Jul 9th, 2008 at 11:54am  
1. How do the remotes look at the time the outage is observed? Are there any low SNRs (downstream or upstream) to indicate that there might be problem with any of the carriers? Any sloping? Fading? When you query the LINE CARD stats in iMonitor do you see any errors at the time service is interrupted?

2. Talk to me about your PP Blades. Are the remotes properly balanced on the blades? i.e. Are there an equal amount of nodes on each blade (right click the blades in iMonitor and select BLADE INFO and expand the remotes tab). Please ensure there is an equal amount of remotes on each PP blade.

3. How long has it been since the PP blades have been balanced? (by telneting to the pp_controller and issuing the blades rebalance command on console?).  Are you familiar with the procedure (*it will cause a service interruption so plan accordingly*)

Considering the fact that it is happening around a certain time of day (by the way was that the case in the latest outage?). I am sticking to my original assessment, and that is possibly a cronjob is tying up your RHEL boxes. That, or your blades are in serious need of a rebalance.
Back to top
« Last Edit: Jul 11th, 2008 at 12:24am by N/A »  
 
IP Logged
 
Ex Member
Ex Member


Reply #12 - Jul 10th, 2008 at 2:35am  
1. Note: when you use "blades rebalance" command at NMS-SVR, all the Remotes will go out of Network and come back again. So, choose a maintenance window to do this activity..!
2. Are you using Single NMS or DNMS? from your inputs, it looks to me ur using DNMS?
3. How many remotes are activaed/configured at your Hub? 
4. Find-out, the problem is from NMS-SVR or PP-SVR or RF side.
5. Check the CPU usage at your NMS-SVR and PP-SVRs. In case the CPU usage >90% check which process consumes more CPU. Incase really it is due to high traffic, you may need add another server.
6. If you find any unwanted files, old log files you may delete it.
7. Make sure the backup is running and the /root/backup directory has less files.
8. Check is there any "Interference Carrier" at your operating Frequency. Check with Satellite Operator for any Sun-Outage.


Hope this helps..!
Back to top
 
 
IP Logged
 
Pages: 1