Sunteți pe pagina 1din 10

Lessons Learned from Launch Vehicle Avionics Systems

C. A. Ignatious Deputy Director, VSSC ( SR ) VSSC, Thiruvananthapuram ca_ignatious@vssc.gov.in

Abstract Failures in Avionics systems are the result of certain errors and mistakes. Studying about the root causes of many avionics system failures give us good lessons. The lessons learnt from failures, help us build better systems. Inadequate capturing of the requirements, insufficient margins, differences in conditions in test and in flight, polarity/sign reversals, incorrect application of components, discarding early warning signals etc. are some of the important reasons for failures. Also systemic deficiencies such as over confidence, complacency, unsustainable project schedules, flaws in review process, inadequate documentation, etc. also adds to the causes of failures. Introduction Success of launch vehicles & satellites are dependent on the performance of avionics system also. In the worldwide launch scenario, during the last decade since 2000, about 30 % of launch failures are caused due to the malfunctioning of avionics systems including software ( Table 1). Also during the final phase of the launch campaign of many missions, anomalies in the electronics system has caused anxieties and delays in launch. Many good lessons are learned from design, development, qualification & acceptance
1

testing, final integrated tests and flight experience. A few cases are discussed here. Lesson 1: Space is unforgiving; Thousands of good decisions can be undone by a single engineering flaw or workmanship error, and these errors and flaws can result in catastrophe It is always the simple stuff that kills you. The failures of launch vehicles are quite different from failure of other systems like T & M equipment, computers or communication systems. In fact launch failures are considered as accidents rather than reliability failures caused by random failures of the components. These are mainly due to design errors or workmanship errors in fabrication. If we look at the history of ISRO launch failures , it can be seen that very simple & silly mistakes/errors were the reasons behind these accidents. Table 1 gives the summary of ISRO launch failures. As stated above, design or workmanship related errors in fabrication are the reasons for these failures. Attending to minute details during design and avoiding deviations & mistakes during fabrication are the key factors which ensure mission success.

Failure Reasons

Percentage

Propulsion Guidance and Navigation Software and computing systems Electrical systems Structures Ordnance Pneumatics & Hydraulics

54% 4% 21% 8% 0% 0% 0% -

correctly as recommended by the part Manufacturer. Provide sufficient de-rating for the parts Design for testability Good PCB layout with good grounding, guarding low level signals against noise, good timing design with adequate margins and taking care of signal integrity issues. Good thermal design adequate margins. Good mechanical with respect to vibration. with

Table 1 Worldwide scenario of launch failures Launch vehicle SLV 3 ASLV PSLV GSLV Flight Nos E1 D1 & D2 D1 F02, D3, F06 Total No of failur es 1 2 1 3 Total Fligh ts 4 4 19 7

packaging shock &

Lesson 3: Systems Requirements To be adequately captured One major concern during the system design phase is that, the requirements of the system and sub-systems are not adequately defined and detailed. Normally, Scope and major specifications are properly defined; however, very detailed requirements to the minute levels, are not made initially and often, during the proto model evaluation many new requirements are discovered. Giving special attention to the details of the requirements and having a thorough discussion in this regard will go a long way in having a good system. This is all the more important in the case of software intensive systems as well as systems with FPGA devices. One classical example is the much publicised MARS Polar Lander failure. The three legs of the Polar lander, which were kept in stowed position, had to deployed at about 1500 metres above ground. Also the engine have to be shut down within 50 mS of legs touching the Mars surface. The shock sensors
2

Table Failures

2:

ISRO

Launch

Lesson 2: Robust Design Essential to mission success We all know that the first and foremost factor determining the reliability of a system is good design. Major factors of a good avionics system design are the following: All the requirements are clearly and unambiguously specified initially itself. Making a very clear specification document of interfaces between the systems Selection of parts and right electronic applying them

mounted on the legs were used to sense the touchdown and then shutdown the engine. System designers were aware that, when the stowed legs were deployed, the sensors on the legs will produce similar momentary signal as on touchdown. However, during the design, the designers failed to implement the requirement that the processing of the leg sensor data shall not begin until 12 metre above the ground. As a result, during the landing phase, the shutdown of the engine happened when the leg sensors generated the false momentary signal, when the legs were deployed at about 1500 meters and hence a 2 year long mission was lost. Lesson 4: Wrong application of avionics parts a major concern. Incorrect usage, wrong application, inadequate de-rating, not following the guidelines of component manufacturer, etc. are major causes of malfunctions of avionics systems. The problems caused by these reasons, are actually much more than system failures, observed due to component quality & reliability related issues. A few case studies are given below:Recently in an avionics package, it was observed that, in the + 15 V supply line, an over shoot, up to 22V is seen, for a duration of about 2 to 3 mS, when the package is switched ON. On analysis it is seen that the data sheet of Interpoint make DC-DC converter MHF+ 2815D specifies the maximum capacitance across its output shall be less than 10 micro farad, while in the package about 80 UF capacitance is put across the supply line. This overshoot may become catastrophic as absolute max voltage spec. for many devices is 16 V or 18 V.
3

Many a times, problems were seen with regard to data corruption in EEPROM devices due to wrong or incorrect data protection schemes employed in the circuits. In certain cases, unused pin terminations results in intermittent malfunctioning of circuits. Inadequate or incorrect power on reset circuits, have caused problems in many packages. Special care in layout design is required with regard to the timing capacitors used with certain devices, as these inputs may be more susceptible to noise. Recently in one of the telemetry packages, excessive spikes in the output data were observed as the RS 485 opto-coupled transceiver device was not provided with the necessary bypass capacitors as suggested by the respective manufacturer. Disregarding the inverse current gain of transistors, in a relay driver circuit, has resulted in sneak paths in the system, which was actually detected after about 15 years of usage. Tinning and Hand soldering of surface mount CDR type ceramic capacitors was a regular practice in many work centres and this has resulted in failures of packages even at launch pad. In fact Chip capacitor manufacturers have recommended to avoid hand soldering practices, mainly because the thermal gradient during the soldering causes cracks in the layers of the multilayer ceramic capacitors which may develop into capacitor shorts. Not adhering to the workmanship practices, during the assembly of devices on to PCB, is a major problem causing many latent failures. Many of the new devices are very fast and extreme care is needed during layout. The line lengths should be kept very short and proper terminations are to be provided for each interconnection to avoid signal integrity related

issues like overshoot and ringing problems. Qualification models passing the tests and later flight models developing problems, especially during thermal tests is a phenomenon, very common now a days, which are attributable to such signal integrity related reasons. The important point is that the design engineers should read and understand all the datasheets, application guidelines, precautions, layout guidelines, good workmanship practices, etc., of every device used in the design. Even copying an already proven circuit from an old design, may cause problems in a new design, unless all these are not properly understood and applied. Lesson 5: Use FPGA devices in critical applications with extreme care. FPGA devices were in use in our launch vehicle projects for more than a decade. Initially low capacity devices, ( 1 to 8 K gates ), were used. Currently 100 to 200 K gate designs are implemented using FPGA devices. This high gate count and associated design complexity is one of the major problems, while designing with FPGA devices. The design methodology followed for FPGA design is neither the good software engineering practices followed in software design, nor the ASIC design methodologies. In both these cases well matured engineering practices exists to avoid mistakes during design and to find & correct errors & bugs before the product is released. Today the FPGA design process is a casual approach, as it is felt that the design errors can be easily corrected compared to cost of correction in an ASIC. There is an overall increase in the usage of FPGA designs for space
4

applications. However the design methodology has not improved significantly and at the same time the risk involved is also not fully appreciated. It is important to follow a good design methodology, as envisaged in the DO 254 standard for Complex electronics with documents like requirements specification, design document, third party independent verification and validation, and very detailed review process. Some of the very important lessons learned from the usage of FPGA designs for onboard applications are given below:The power ON behaviour of FPGA devices has to be studied very carefully. Initialising all the flip flops in a FPGA device is good, but may not be necessary as it may increase the usage of routing resources. However, all the flip-flops driving the Outputs shall be initialised during power ON Using an external Schmitt trigger inverter or buffer is recommended to rout the reset signal to the input of FPGA Asynchronous assertion and synchronous de assertion of the reset, make the FPGA design more robust and can tolerate start up delays of oscillators and other internal delays. Internal clock buffers are to be used to rout the reset signal as well as the clock inside the FPGA. As the available clock buffers are finite, limit the number of clocks used inside the FPGA. Avoid derived clocks and gated clocks as this will force the designers to use normal

lines other than clock buffers for driving the clock inputs of Flip flops. This will increase the clock skews as well as reduce the testability of the circuits. Metastability related issues may develop when signals are transferred over different clock domains. Necessary precautions have to be taken to avoid this. Unused inputs such as test & Mode pins have to be properly terminated. During synthesis, SAFE option shall be enabled to ensure recovery from illegal states The design documents, VHDL codes, test benches, Synthesis tool configuration , fuse maps etc. have to be version & configuration controlled.

availability, lower cost, increased functionality of the components, etc. The failure rate and reliability levels are comparable to that of mil devices. The important lessons learned from the usage of PEMs are as below Select components from reputed manufacturers only Employ the components only after a proper evaluation and qualification tests to establish margins. Provide protection against moisture during storage

Currently about 60 to 70 % of the total number of semiconductor devices used in onboard applications are industrial grade semiconductor parts. Lesson 7: interfaces Take care of

The important lesson is that, the designers of FPGA designs should be experts in the area, with detailed knowledge of all the intricacies of the devices, design methodologies, verification and validation practices and above all experts in the design and verification tools. All the designs have to be thoroughly reviewed by a team of experts both in FPGA design and the system design. Lesson 6: Experience with Industrial Grade Plastic Encapsulated Microcircuit (PEM) devices is really good. It is true that more and important lessons are learned from failures. However, we can learn lessons from successes also. There was a myth that only Mil grade / space grade devices are suitable for launch vehicle applications. However, the bold decision to use industrial grade PEM devices in launch vehicles, especially in less critical telemetry applications has paid dividends. The major benefits of using industrial grade devices are
5

One major problem in large systems, like launch vehicles and satellites, is in defining and implementing the proper electrical and mechanical interfaces between systems. Also maintaining the proper interfaces between the different working teams is very important and demanding. In fact there was a major failure, as one of the teams provided the data in FPS units and the other team interpreted the numbers in MKS units. It is important to have a well defined interface document where all the interface specifications are provided without ambiguity. This document has to be reviewed and approved by all the teams working on the project.

Lesson 8: Demonstrate design margins. Demonstrating the design margins is equally important as Robust design for ensuring mission success. This is to be

done in the early phase of the project itself. Design margins are to be compliant with environment, interfaces, tolerances and uncertainties. It is important to analytically determine the margins prior to testing the system. A proper derating analysis of every component in the system, with regard to voltage, current, power and thermal characteristics will give good assurance with regard to electrical stresses. Vibration analysis of the chassis or packaging and mounting details of the unit on to the launch vehicle has to be done to ascertain the margins available. After ensuring through design analysis, that sufficient margins exist, the actual systems have to be put to the required tests. Lesson 9: There is alternative to testing. no

always possible to meet the above guidelines. In such cases systematic analysis of differences between the test and flight conditions has to carried out to understand the limitations of ground test and thus assess the risks involved and find ways to mitigate the risks. Use real flight systems instead of simulations wherever feasible, as simulations may sometimes miss some important points. Assumptions used in test and simulations are to be fully understood.

Lesson 10: Test Induced failures are also a concern. While it is very essential to test every subsystem as described above, it is also extremely important to ensure that the testing is done very carefully. Many avionics systems had failed during testing due to operational errors, faulty test equipment, wrong test conditions, improper power up sequences, etc. Some lessons learned are:All tests are to be done based on a test plan, which is reviewed and approved by experts from design and test agencies. The test equipment or checkout systems should have necessary safety interlocks to prevent accidental damage to the test article. Never perform a new test on the flight system for the first time. First it is to be performed on a ground model before doing on the flight unit. Some common problems are: excess neutral to earth voltage, isolation degradation between onboard and ground systems, improper over voltage/ over current settings, A/C not working, thermal

A product or subsystem is designed and developed to perform defined functions meeting the requirements, specifications, interface definitions, environmental conditions etc. Testing is the only process to validate that all the above are satisfactorily met. There are many good lessons learnt with regard to testing. The system has to be tested for all that, the system should do and should not do. The tests are to be representative of flight conditions. Test as you FLY and Fly as you test A frequent cause of maiden flight failures is that the ground tests are not truly representing the flight conditions. It may not be
6

chambers not having humidity control, Inadequate ESD control, poor training/inexperience/ fatigue of the operators It is to be noted that realising a flight hardware, meeting all the quality norms, is not an easy task. Test induced failure is a reality and hence extreme care is to be exercised while testing the flight systems. Lesson 11: Early warning Signals to be considered as very serious. Ignoring early warning signals has been a major factor in launch failures. Every minor deviations observed during the tests and previous flights, have to be thoroughly analysed to understand the reasons behind them, even though the deviations are acceptable from the mission point of view. Similarly certain observations in a test may not repeat when retested. Such onetime observations also shall not be treated as insignificant as sometimes these may be real warning signs, and such observations reappear at the most difficult moments. In the space shuttle history, Columbia disaster was triggered by a poly urethane foam hitting the wings of the shuttle. This was seen at the time of take off. However, the NASA team did not consider it significant enough to rectify the same during the flight, which has resulted in disastrous consequences. Probably one of our previous flights would have been a success, if we had given enough importance to the demating of one connector carrying T/M signals, during the flight previous to that. Lesson 12: Avoid changes from a qualified system. Qualified and proven flight systems, both hardware and
7

software, shall never be changed unless there are compelling reasons. In the history of launch vehicles and satellites, last minute changes, especially those made in the heat of countdown have resulted in several failures. It is very difficult to thoroughly verify and validate the last minute changes and their side effects. Many times it is stated that the changes are for improvements and most of the times, along with the intended improvements, there will be some side effects, which are detrimental. If changes are to be made into an already qualified system, there should be a very detailed justification and comprehensive review before implementing the changes. If possible, the system has to be requalified, at least for the changes. Lesson 13: Sign errors involving orientation and phasing ( polarity ) are very serious concerns. Many failures have happened in launch vehicles as well as in satellites which are the results of reversing the sign or polarity of the signals. This is especially important in control systems where polarity or sign reversal results in positive feedback instead of negative feedback, which increases the error instead of correcting/or reducing the error, with disastrous consequences. The sign reversal error can happen in the inter connections, especially between the subsystems and inside the avionics packages. This error can happen in the software also , in the data that are sent to the actuation elements. We also had a near miss situation, in one of our flights due to sign reversal, in spite of the very elaborate sign checks done

to ensure the correct polarity of signals. Lesson 14: Strict adherence to qualified process. Every flight system, whether it is hardware or software is to be realised through a qualified and documented process. Many times, the failures or large rejections are the result of process deviations or unauthorised changes from the qualified process. As earlier stated about the qualified system, a qualified and approved process should not be changed unless there are compelling requirements. Also the product should be realised strictly adhering to the process. Large number of failures in avionics packages were observed due to defective PCBs from a qualified manufacturer. The reason being the mfr. did not meticulously follow the documented process. In another earlier case, an HMC mfr. used a fresh batch of Iso propyl alcohol, for cleaning purposes without evaluating and ascertaining the contamination level, which caused ionic contamination of the HMC devices resulting in high leakage currents. The important lesson is that, there should be a reviewed and approved process document and all flight systems have to be realised strictly according to this process. Lesson 15: Identification of systemic factors and appropriate actions to remedy the effects are very important . Failures will not repeat the same way. The solution to a particular problem will prevent the same failure repeating in future. But it will not prevent future failures unless the systemic
8

deficiencies which caused the problem is addressed. Identifying the systemic deficiencies and taking action to rectify them are very important. Some of the very common systemic deficiencies are given below:a) Complacency Confidence. and Over

It is really ironic that success itself becomes the reason for failures. A number of continuous flight successes, results in developing a sense of over confidence and complacency. Even when deviations are seen, past successes in flight will be quoted as examples to discount the evidences of risk. Risks were tolerated as they had been experienced before and no attempt is made to correct or eliminate them. Over confidence leads to cutting corners and relaxing safety margins or make tradeoffs that increase risk. b) Driving to unsustainable schedules. Schedule pressure is a common theme for many of the devastating failures. Many of the Soviet space accidents were the result of the schedule pressures, during the space race between US and USSR. Our experience also give ample evidence to show that many failures of avionics systems and subassembly level failures are caused by waning of alertness and caution, in an atmosphere of eagerness to complete the task to meet the deadline. Fatigue, due to increased working hours, is also another reason for accidents. Schedule pressure also results in increasing risks in which too many corners are cut in applying proven engineering practices and in the checks and balances necessary for a mission success. Project management teams appear primarily focussed on schedule objectives and do not adequately

focus on concerns.

quality

assurance

Working under schedule pressure is always a necessity in result oriented organisations and it is essential to understand and be aware of the risks involved. It is in such situationsm that conscious decisions are to be taken by the working teams and QA personnel, to NOT TO deviate from the approved processes. Of course Schedules are important; Quality is more important. The team leader should train the team members not to become panic if deadlines are missed due to sticking to quality practices. c) Over relying on redundancy Redundancy is employed to improve reliability against component failures and it is most effective against random failures. It is important to note that redundancy is not effective against design errors or inadequate requirement specification. Redundancy will be useful only if prime system failure does not affect the redundant system and the FDI scheme detects all the faults. One common drawback is that, in spite of having redundancy, there will be many single point failures due to the deficiencies in the FDI scheme. The redundancy management and FDI logic results in increasing the complexity which may actually reduce reliability. Redundancy is not a solution to all the problems and its strengths and weaknesses have to be thoroughly understood. d) Flaws in Review processes

A thorough review by experts is really a very good exercise to bring out the deficiencies in hardware and software systems. In fact one of the strengths of ISRO is that of having an excellent review mechanism at all levels. However its effectiveness is lost when :Participation of experts and members become poor. Preparation documentation insufficient and becomes

Review meetings are called in short notice It becomes a ritual only e) Inadequate usage of Good QA tools FMEA and FMECA studies are very effective in finding out system weaknesses and single point failures and thus helps in improving system safety and reliability aspects. It is effective only if FMECA studies are done and the results are reviewed, during the initial design phase of the system. Once the systems are realised and qualified, modifications if any, suggested by FMECA studies are very difficult to implement. Summary Successes and failures teach good lessons. However, failures give us more important lessons. The failures in the case of launch vehicles are not the classical reliability related failures. These are more due to design errors, workmanship related fabrication errors, mistakes in the usage of electronics parts, documentation mistakes, etc.. A robust design with sufficient margins is a must for mission success. Testing the
9

system is essential to verify and validate all the assumptions made. Test as you fly and fly as you test should be the guiding principle while testing the systems. Do not ignore any deviations and attend to early warning signals for ensuring successful flight. There are certain systemic deficiencies, such as overconfidence,

complacency, flaws in review process, inadequate documentation etc. which are to be addressed so as to achieve mission success.

References: 1 Evaluating Failures and near misses in Human Space Flight History for lessons for future Human space flight, by Stephanie Barr, Aerospace corporation Developing safety critical software requirements for commercial reusable launch vehicles, by Daniel P Murray and Terry L Hardy, FAA and NASA USA Learning from other peoples mistakes Paul Cheng and Patrick Smith, Aerospace corporation Systemic factors in Software related spacecraft accidents Prof. Nancy G. Leveson, MIT US

2.

3. 4.

10

S-ar putea să vă placă și