Sunteți pe pagina 1din 6

5/11/12

Troubleshooting Common Problems (N1 Grid Engine 6 User's Guide)

DocumentationHome>N1GridEngine6User'sGuide>Chapter6ErrorMessages,andTroubleshooting>TroubleshootingCommonProblems

N1GridEngine6User'sGuide
Previous:DiagnosingProblems

Next:AppendixADatabaseSchemas

TroubleshootingCommonProblems
Thissectionprovidesinformationtohelpyoudiagnoseandrespondtothecauseofcommonproblems.
ProblemTheoutputfileforyourjobsays, Warning:noaccesstottythusnojobcontrolinthisshell....
PossiblecauseOneormoreofyourloginfilescontainan sttycommand.Thesecommandsareusefulonlyifaterminalis
present.
PossiblesolutionNoterminalisassociatedwithbatchjobs.Youmustremoveall sttycommandsfromyourloginfiles,oryou
mustbracketsuchcommandswithan ifstatement.The ifstatementshouldcheckforaterminalbeforeprocessing.The
followingexampleshowsan ifstatement:

/bin/csh:
sttyg#checksterminalstatus
if($status==0)#succeedsifa
terminalispresent
<putallsttycommandsinhere>
endif
ProblemThejobstandarderrorlogfilesays `tty`: Ambiguous.However,noreferenceto ttyexistsintheuser'sshellthatis
calledinthejobscript.
Possiblecause shell_start_modeis,bydefault, posix_compliant.Thereforealljobscriptsrunwiththeshellthatis
specifiedinthequeuedefinition.Thescriptsdonotrunwiththeshellthatisspecifiedonthefirstlineofthejobscript.
PossiblesolutionUsethe Sflagtothe qsubcommand,orchange shell_start_modeto unix_behavior.
ProblemYoucanrunyourjobscriptfromthecommandline,butthejobscriptfailswhenyourunitusingthe qsubcommand.
PossiblecauseProcesslimitsmightbebeingsetforyourjob.Totestwhetherlimitsarebeingset,writeatestscriptthatperforms
limitand limithfunctions.Runbothfunctionsinteractively,attheshellpromptandusingthe qsubcommand,tocompare
theresults.
PossiblesolutionRemoveanycommandsinconfigurationfilesthatsetslimitsinyourshell.
Problem Executionhostsreportaloadof99.99.
PossiblecauseThe execddaemonisnotrunningonthehost.
PossiblesolutionAs root,startupthe execddaemonontheexecutionhostbyrunningthe
$SGE_ROOT/default/common/'rcsge'script.
PossiblecauseAdefaultdomainisincorrectlyspecified.
PossiblesolutionAsthegridenginesystemadministrator,runthe qconfmconfcommandandchangethe
default_domainvariableto none.
PossiblecauseThe qmasterhostseesthenameoftheexecutionhostasdifferentfromthenamethattheexecutionhostsees
foritself.

docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-6117/i999787/index.html

1/6

5/11/12

Troubleshooting Common Problems (N1 Grid Engine 6 User's Guide)

PossiblesolutionIfyouareusingDNStoresolvethehostnamesofyourcomputecluster,configure /etc/hostsandNISto
returnthefullyqualifieddomainname(FQDN)astheprimaryhostname.Ofcourse,youcanstilldefineandusetheshortaliasname,
forexample, 168.0.0.1myhost.dom.commyhost.
IfyouarenotusingDNS,makesurethatallofyour /etc/hostsfilesandyourNIStableareconsistent,forexample, 168.0.0.1
myhost.corpmyhostor 168.0.0.1myhost
ProblemEvery30secondsawarningthatissimilartothefollowingmessageisprintedtocell /spool/host /messages:

TueJan2321:20:462001|execd|meta|W|local
configurationmetanotdefinedusingglobalconfiguration
Butcell /common/local_confcontainsafileforeachhost,withFQDN.
PossiblecauseThehostnameresolvingatyourmachine metareturnstheshortname,butatyourmastermachine, metawith
FQDNisreturned.
PossiblesolutionMakesurethatallofyour /etc/hostsfilesandyourNIStableareconsistentinthisrespect.Inthisexample,a
linesuchasthefollowingtextcoulderroneouslybeincludedinthe /etc/hostsfileofthehost meta:
168.0.0.1metameta.your.domain
Thelineshouldinsteadbe:
168.0.0.1meta.your.domainmeta.
ProblemOccasionallyyousee CHECKSUMERROR, WRITEERROR,or READERRORmessagesinthe messagesfilesofthe
daemons.
PossiblecauseAslongasthesemessagesdonotappearinaonesecondinterval,youneednotdoanything.Thesemessages
typicallycanappearbetween1and30timesaday.
ProblemJobsfinishonaparticularqueueandreturnthefollowingmessagein qmaster/messages:

WedMar2810:57:152001|qmaster|masterhost|I|job490.1
finishedonhostexechost
Thenyouseethefollowingerrormessagesintheexecutionhost's exechost/messagesfile:

WedMar2810:57:152001|execd|exechost|E|can'tfinddirectory
"active_jobs/490.1"forreapingjob490.1

WedMar2810:57:152001|execd|exechost|E|can'tremovedirectory
"active_jobs/490.1":opendir(active_jobs/490.1)failed:
Input/outputerror
PossiblecauseThe $SGE_ROOTdirectory,whichisautomounted,isbeingunmounted,causingthe sge_execddaemonto
loseitscurrentworkingdirectory.
PossiblesolutionUsealocalspooldirectoryforyour execdhost.Settheparameter execd_spool_dir,using qmonorthe
qconfcommand.
ProblemWhensubmittinginteractivejobswiththe qrshutility,yougetthefollowingerrormessage:

%qrshlmem_free=1Gerror:error:nosuitablequeues
However,queuesareavailableforsubmittingbatchjobswiththe qsubcommand.Thesequeuescanbequeriedusing qhostl
mem_free=1Gand qstatflmem_free=1G.
PossiblecauseThemessage error:nosuitablequeuesresultsfromthe wesubmitoption,whichisactivebydefault
forinteractivejobssuchas qrsh.Lookfor weonthe qrsh(1)manpage.Thisoptioncausesthesubmitcommandtofailifthe
qmasterdoesnotknowforsurethatthejobisdispatchableaccordingtothecurrentclusterconfiguration.Theintentionofthis
mechanismistodeclinejobrequestsinadvance,incasetherequestscan'tbegranted.

docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-6117/i999787/index.html

2/6

5/11/12

Troubleshooting Common Problems (N1 Grid Engine 6 User's Guide)

PossiblesolutionInthiscase, mem_freeisconfiguredtobeaconsumableresource,butyouhavenotspecifiedtheamountof
memorythatistobeavailableateachhost.Thememoryloadvaluesaredeliberatelynotconsideredforthischeckbecausememory
loadvaluesvary.Thustheycan'tbeseenaspartoftheclusterconfiguration.Youcandooneofthefollowing:
Omitthischeckgenerallybyexplicitlyoverridingthe qrshdefaultoption wewiththe wnoption.Youcanalsoputthis
commandintosgeroot /cell /common/cod_request.
Ifyouintendtomanage mem_freeasaconsumableresource,specifythe mem_freecapacityforyourhostsin
complex_valuesof host_confbyusing qconfmehostname.
Ifyoudon'tintendtomanage mem_freeasaconsumableresource,makeitanonconsumableresourceagaininthe
consumablecolumnof complex(5)byusing qconfmchostname.
Problem qrshwon'tdispatchtothesamenodeitison.Froma qshshellyougetamessagesuchasthefollowing:

host2[49]%qrshinherithost2hostname
error:executingtaskofjob1failed:
host2[50]%qrshinherithost4hostname
host4
Possiblecause gid_rangeisnotsufficient. gid_rangeshouldbedefinedasarange,notasasinglenumber.Thegrid
enginesystemassignseachjobonahostadistinct gid.
PossiblesolutionAdjustthe gid_rangewiththe qconfmconfcommandorwith QMON.Thesuggestedrangeisasfollows:

gid_range2000020100
Problem qrshinheritVdoesnotworkwhenusedinsideaparalleljob.Yougetthefollowingmessage:

cannotgetconnectionto"qlogin_starter"
PossiblecauseThisproblemoccurswithnested qrshcalls.Theproblemiscausedbythe Voption.Thefirst qrsh
inheritcallsetstheenvironmentvariable TASK_ID. TASK_IDistheIDofthetightlyintegratedtaskwithintheparalleljob.The
second qrshinheritcallusesthisenvironmentvariableforregisteringitstask.Thecommandfailsasittriestostartataskwith
thesameIDasthealreadyrunningfirsttask.
PossiblesolutionYoucaneitherunset TASK_IDbeforecalling qrshinherit,orchoosetousethe voptioninsteadof
V.Thisoptionexportsonlytheenvironmentvariablesthatyoureallyneed.
Problem qrshdoesnotseemtoworkatall.Messageslikethefollowingaregenerated:

host2$qrshverbosehostname
localconfigurationhost2notdefinedusingglobalconfiguration
waitingforinteractivejobtobescheduled...
Yourinteractivejob88hasbeensuccessfullyscheduled.
Establishing/share/gridware/utilbin/solaris64/rshsession
tohostexehost...
rcmd:socket:Permissiondenied
/share/gridware/utilbin/solaris64/rshexitedwithexitcode1
readingexitcodefromshepherd...
error:errorwaitingonsocketforclienttoconnect:
Interruptedsystemcall
error:errorreadingreturncodeofremotecommand
cleaningupafterabnormalexitof
/share/gridware/utilbin/solaris64/rsh
host2$
PossiblecausePermissionsfor qrsharenotsetproperly.
PossiblesolutionCheckthepermissionsofthefollowingfiles,whicharelocatedin $SGE_ROOT/utilbin/.(Notethat rlogin
and rshmustbe setuidandownedby root.)
rsxx1rootroot28856Sep1806:00rlogin*

docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-6117/i999787/index.html

3/6

5/11/12

Troubleshooting Common Problems (N1 Grid Engine 6 User's Guide)

rsxx1rootroot19808Sep1806:00rsh*
rwxrxrx1sgeadminadm128160Sep1806:00rshd*
Note
ThesgerootdirectoryalsoneedstobeNFSmountedwiththe setuidoption.Ifsgerootismountedwith nosuidfromyour
submitclient, qrshandassociatedcommandswillnotwork.
ProblemWhenyoutrytostartadistributedmake, qmakeexitswiththefollowingerrormessage:

qrsh_starter:executingchildprocess
qmakefailed:Nosuchfileordirectory
PossiblecauseThegridenginesystemstartsaninstanceof qmakeontheexecutionhost.Ifthegridenginesystem
environment,especiallythe PATHvariable,isnotsetupintheuser'sshellresourcefile( .profileor .cshrc),this qmakecall
fails.
PossiblesolutionUsethe voptiontoexportthe PATHenvironmentvariabletothe qmakejob.Atypical qmakecallisas
follows:

qmakevPATHcwdpemake210
ProblemWhenusingthe qmakeutility,yougetthefollowingerrormessage:

waitingforinteractivejobtobescheduled...timeout(4s)
expiredwhilewaitingonsocketfd5
Your"qrsh"requestcouldnotbescheduled,tryagainlater.
PossiblecauseThe ARCHenvironmentvariablecouldbesetincorrectlyintheshellfromwhich qmakewascalled.
PossiblesolutionSetthe ARCHvariablecorrectlytoasupportedvaluethatmatchesanavailablehostinyourcluster,orelse
specifythecorrectvalueatsubmittime,forexample, qmakevARCH=solaris64...

TypicalAccountingandReportingConsoleErrors
Problem:
TheinstallationoftheSunWebconsoleVersion2.0.3failswiththefollowerrormessage:

#./inst_reporting
...
RegistertheN1SGEreportingmoduleinthewebconsole
Registeringcom.sun.grid.arco_6u3.
StartingSun(TM)WebConsoleVersion2.0.3...
Ambiguousoutputredirect.
Solution:
.ThisSunWebConsoleVersioncanonlybeinstalledbytheuser noacceswhohas /bin/shastheirloginshell.Theusermustbeadded
withthefollowingcommand:

#useraddu60002g60002d/tmps/bin/shc"NoAccessUser"noaccess
Problem:
Thetable/viewdropdownmenuofasimplequerydefinitiondoesnotcontainanyentry,butthetablesaredefinedinthedatabase.
Solution:

docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-6117/i999787/index.html

4/6

5/11/12

Troubleshooting Common Problems (N1 Grid Engine 6 User's Guide)

TheproblemnormallyoccursifOracleisusedasthedatabase.Duringtheinstallationofthereportingmodulethewrongdatabaseschema
namehasbeenspecified.ForOracle,thedatabaseschemanameisequaltothenameofthedatabaseuserwhichisusedby dbwriter
(thedefaultnameis arco_write).ForPostgres,thedatabaseschemanameshouldbe public.
Problem:
Connectionrefused.
Solution:
The smcwebservermightbedown.Startorrestartthe smcwebserver.
Problem:
Thelistofqueriesorthelistofresultsisempty.
Solution:
Thecausecanbeanyofthefollowing:
Thedatabaseisdown.Startorrestartthedatabase.
Nomoredatabaseconnectionsareavailable.Increasethenumberofallowableconnectionstothedatabase.
Anerrorexistsintheconfigurationfileoftheapplication.Checktheconfigurationforwrongdatabaseusers,wronguserpasswords,or
wrongtypeofdatabase,andthenrestarttheapplication.
Noqueriesareavailable.Ifthequerydirectory /var/spool/arco/queriesisnotempty,thefollowingerrorsmighthaveoccurred:
QueriesintheXMLfilesaresyntacticallyincorrect.CheckthelogfileforerrormessagesfromtheXMLparser.
User noaccesshasnoreadorwritepermissionsonthequerydirectory.
Problem:
Thelistofavailabledatabasetablesisempty.
Solution:
Thecausecanbeanyofthefollowing:
Thedatabaseisdown.Startorrestartthedatabase.
Nomoredatabaseconnectionsareavailable.Increasethenumberofallowableconnectionstothedatabase.
Anerrorexistsintheconfigurationfileoftheapplication.Checktheconfigurationforwrongdatabaseusers,wronguserpasswords,or
wrongtypeofdatabase,andthenrestarttheapplication.
Problem:
Thelistofselectablefieldsisempty.
Solution:
Notableisselected.Selectatablefromthelist.
Problem:
Thelistoffiltersisempty.
Solution:
Nofieldsareselected.Defineatleastonefield.
Problem:
Thesortlistisempty.
Solution:
Nofieldsareselected.Defineatleastonefield.
Problem:
Adefinedfilterisnotused.
Solution:
Thefiltermaybeinactive.Modifytheunusedfilterandmakeitactive.

docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-6117/i999787/index.html

5/6

5/11/12

Troubleshooting Common Problems (N1 Grid Engine 6 User's Guide)

Problem:
Thelatebindingintheadvancedqueryisignored,buttheexecutionrunsintoanerror.
Solution:
Thelatebindingmacrohasasyntacticalerror.Thecorrectsyntaxforthelatebindingmacrointheadvancedqueryisasfollows:

latebinding{attributeoperator}
latebinding{attributeoperatordefaultvalue}
Problem:
Thebreadcrumbisusedtomoveback,buttheloginscreenisshown.
Solution:
Thesessiontimedout.Loginagain,orraisethesessiontimeinthe app.xml.
Problem:
Theviewconfigurationisdefined,butthedefaultconfigurationisshown.
Solution:
Thedefinedviewconfigurationisnotsettobevisible.Opentheviewconfigurationanddefinetheviewconfigurationtobeused.
Problem:
Theviewconfigurationisdefined,butthelastconfigurationisshown.
Solution:
Thedefinedviewconfigurationisnotsettobevisible.Opentheviewconfigurationanddefinetheviewconfigurationtobeused.
Problem:
Theexecutionofaquerytakesaverylongtime.
Solution:
Theresultscomingfromthedatabaseareverylarge.Setalimitfortheresults,orextendthefilterconditions.
Previous:DiagnosingProblems

Next:AppendixADatabaseSchemas
2010,OracleCorporationand/oritsaffiliates

docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-6117/i999787/index.html

6/6

S-ar putea să vă placă și