Sunteți pe pagina 1din 94

DESIGNER FORUM

2010
March 24 to 26, 2010
Ipojuca | Porto de Galinhas Beach
BRAZIL
P
R
O
C
E
E
D
I
N
G
S
Edval J. P. Santos
Cristiano C. Arajo
Valentin O. Roda
Editors
U F P E
www.fade.org.br
2
TE@ I
Design House
TABLE OF CONTENTS
Microcontrolador compatible con AVR, interfaz de depuracion y bus wishbone ......................................1
Tropea S. E. y Caruso D. M.
Instituto Nacional de Tecnologia Industrial, Argentina

Experiencia acadmica sobre incorporacin de la metodologa de diseo basada en HDL en una carrera
de ingeniera electrnica .....................................................................................................................7
Martinez R., Corti R., D'Agostino E., Belmonte J. y Giandomenico E.
Universidad Nacional de Rosario , Argentina

Audio sobre ethernet: implementacin utilizando FPGA .............................................................13
Mosquera J., Stoliar A., Pedre S., Sacco M. y Borensztejn P.
Universidad de Buenos Aires, Argentina
Use of self-checking logic to minimize the effects of single event transients in space applications 19
Ortega-Ruiz J. and Boemo E.
Universidad Autonoma de Madrid, Spain

Wireless Internet configurable network module ...........25
Schiavon M. I., Crepaldo D. A., Martin R. L.
Laboratorio de Microelectronica FCEIA, UNR, Argentina

MIC- A new compression method of instructions in hardware for embedded systems ...29
Dias W. R. A., Barreto R. da S. and Moreno E. D.
Department of Computer Science -Federal University of Amazonas, Brazil
Department of Computer Science -Federal University of Sergipe, Brazil

Embedded system that simulates ECG waveforms ...................................................................................35
De Farias T. M. T. and De Lima J. A. G.
Universidade Federal da Paraiba, Brazil

An FPGA based converter from fixed point to logarithmic number system for real time applications ...39
De Maria E. A. A., Maidana C. E. and Szklanny F. I.
Universidad Nacional de La Matanza, Argentina

Hardware co-processing unit for real-time scheduling analysis ...............................................43
Urriza J., Cayssials R. and Ferro E.
Universidad Nacional del Sur , Argentina

Implementao em hardware do mtodo de minkowsky para o clculo da dimenso fractal ..47
Maximiliam Luppe
Universidade de Sao Paulo, Brazil
An entry level platform for teaching high performance reconfigurable computing .................................53
Viana P., Soares D. and Torquato L.
Federal University of Alagoas, Brazil

Derivation of PBKFD2 keys using FPGA ............................57
Pedre S., Stoliar A. and Borensztejn P.
Universidad de Buenos Aires, Argentina

Automatic synthesis of synchronous controllers with low activity of the clock ...........63
Del Rios J., Oliveira D. L. and Romano L.
Instituto Tecnolgico de Aeronutica, Brazil
Centro Universitrio da FEI, Brazil

Ajuste de hierarquia de memria para reduo de consumo de energia com base em otimizao por
enxame da particulas (PSO) ..............................69
Cordeiro F.R., Caraciolo M.P., Ferreira, L.P. and Silva-Filho A.G.
Federal University of Pernambuco, Brazil

IP-core de uma memria cache reconfigurvel .............................75
Gazineu, G.M.; Silva-Filho, A.G.; Prado, R.G.; Carvalho, G.R.; Araujo, A.H.C.B.S. e De Lima, M.E.
Universidade Federal de Pernambuco, Brazil

A note on modeling pulsed sequential circuits with VHDL .................81
Mesquita Junior A. C.
Universidade Federal de Pernambuco, Brazil

Comparative study between the implementations of digital waveforms free of third harmonic on FPGA
and microcontroller ...................................85
Freitas D. R. R. and Santos E. J. P.
Universidade Federal de Pernambuco, Brazil
MICROCONTROLADOR COMPATIBLE CON AVR, INTERFAZ DE DEPURACI

ON Y BUS
WISHBONE
S.E. Tropea y D.M. Caruso
Electr onica e Inform atica
Instituto Nacional de Tecnologa Industrial
Buenos Aires, Argentina
email: salvador@inti.gob.ar
ABSTRACT
En este trabajo presentamos un microcontrolador compati-
ble con la lnea AVR de Atmel. Esta implementaci on pue-
de ser congurada para ser compatible con los AVR de se-
gunda (ej. ATtiny22), tercera (ej. ATmega103) y cuarta (ej.
ATmega8) generaci on.
Este desarrollo incluye los siguientes perif ericos com-
patibles: controlador de interrupciones, puertos de entrada y
salida, temporizadores y contadores, UART y watchdog.
Para adaptarlo a distintas necesidades se incluy o una in-
terfaz de expansi on que utiliza el est andar de interconexi on
WISHBONE.
Alos nes de facilitar el desarrollo de aplicaciones sobre
esta plataforma se lo dot o de una unidad de depuraci on y se
adapt o el software necesario para poder realizar depuraci on
de alto nivel con una interfaz de usuario simple e intuitiva.
El dise no fue vericado usando simuladores y FPGAs
de Xilinx (Spartan II y 3A).
1. INTRODUCCI

ON
Nuestro laboratorio realiza aplicaciones con microcon-
troladores embebidos de manera frecuente. Es por esta raz on
que al abordar la tecnologa FPGA se dese o implementar
microcontroladores compatibles con los usados en dichos
desarrollos. De esta manera personas del equipo que no se
dedicaran a esta nueva tecnologa podran participar en desa-
rrollos con FPGAs.
Con esta nalidad se encar o el desarrollo de un micro-
controlador compatible con el PIC 16C84 [1] y posterior-
mente se a nadi o una interfaz de depuraci on para el mismo
[2]. Este microcontrolador mostr o ser util y fue transferido
a la industria aero-espacial [3]. Sin embargo su arquitectu-
ra no es buena para la programaci on en lenguaje C. En los
ultimos a nos nuestro laboratorio se volc o al uso de la lnea
AVR de Atmel, que si fue dise nada para ser programada en
lenguaje C. Por lo que se decidi o encarar el desarrollo de un
equivalente para FPGAs.
En este trabajo presentamos las caractersticas del mi-
crocontrolador desarrollado y de una interfaz de depuraci on
que permite depurar programas escritos en lenguaje C desde
una PC y utilizando un software simple e intuitivo.
2. MICROCONTROLADOR
2.1. Arquitectura
El AVR es un microcontrolador del tipo RISC de 8 bits
con dos espacios de memoria completamente independien-
tes: memoria de programa y memoria de datos.
En la memoria de programas se encuentra el c odigo a
ejecutar. Es una memoria de 16 bits y la mayor parte de las
instrucciones son de este tama no. Algunas instrucciones ne-
cesitan dos posiciones de memoria (32 bits).
La memoria de datos es de 8 bits y se divide en tres sec-
ciones. Existen instrucciones especcas para acceder a cada
una de estas secciones de memoria, pero tambi en hay ins-
trucciones que pueden acceder a todo el espacio de memoria
indistintamente. La parte m as baja de esta memoria alberga
32 registros de 8 bits, seis de los cuales pueden agruparse de
a pares para formar tres registros de 16 bits, usualmente usa-
dos como punteros. A continuaci on se encuentra el espacio
de entrada/salida, con un total de 64 posiciones de 8 bits. El
resto de la memoria es RAM.
A partir de la segunda generaci on la lnea AVR imple-
menta un stack pointer. El mismo es decrementado cada vez
que se guarda algo en la pila, usualmente se lo inicializa con
la direcci on m as alta de memoria RAM. Este recurso facilita
el compilado de aplicaciones en lenguaje C.
La mayor parte de las instrucciones se ejecutan en un
ciclo de reloj, pero algunas pueden tomar hasta cuatro ciclos,
como en el caso del CALL.
El stack pointer y el registro de estados se encuentran
mapeados en el espacio de memoria de entrada y salida.
La ALU implementa las operaciones b asicas de suma,
1
resta y desplazamiento. A partir de la cuarta generaci on se
introduce la multiplicaci on de enteros con o sin signo y en
formato de punto jo (1.7).
2.2. Implementaci on
Las herramientas de desarrollo utilizadas fueron las re-
comendadas por el proyecto FPGALibre [4] [5]. Para este
desarrollo se utilizaron estaciones de trabajo que corren De-
bian [6] GNU [7] /Linux. Siendo el lenguaje de descripci on
de hardware utilizado el VHDL.
A los nes de simplicar la tarea y comenzar por una
base de c odigo funcional se decidi o basar el desarrollo en
el proyecto AVR Core [8] de OpenCores.org. Primeramente
se adapt o el c odigo a los lineamientos del proyecto FPGA-
Libre. El c odigo VHDL original se encuentra escrito a muy
bajo nivel, sin explotar toda la expresividad del lenguaje,
por lo que se decidi o reescribirlo para lograr un c odigo m as
compacto y f acil de mantener. Tambi en se unicaron m odu-
los que estaban separados, como la ALU y el procesador de
operaciones orientadas a bits.
El proyecto original s olo implementa la tercer genera-
ci on de AVR, en particular trata de modelar el ATmega103.
Para lograr mayor exibilidad se decidi o hacer parametri-
zable a la CPU permitiendo seleccionar entre tres posibles
generaciones. De esta manera es posible aprovechar mejor
el area de FPGA seleccionando entre tres versiones diferen-
tes de acuerdo a la complejidad del proyecto en cuesti on.
Las principales diferencias entre la segunda y la terce-
ra generaci on son el tama no del stack pointer (8 bits en la
segunda y 16 en las posteriores) y la falta de las instruccio-
nes de salto absoluto (JMP y CALL). La cuarta generaci on
agrega multiplicaci on, movimiento entre pares de registros
(16 bits) y mejora la instrucci on LPM (acceso a la memoria
de programa).
2.3. Perif ericos
A los nes de facilitar el uso de aplicaciones ya existen-
tes para la lnea AVR se decidi o implementar algunos de los
perif ericos m as comunes.
Controlador de interrupciones: el mismo permite con-
gurar, enmascarar, etc. las interrupciones externas. Se reali-
zaron dos implementaciones diferentes, una compatible con
las lneas modernas (ej. ATtiny22 y ATmega8) y otro con las
anteriores (ej. ATmega103).
Puertos de entrada y salida: permiten congurar pines
como entrada o salida. Se realiz o una implementaci on nueva
y exible que pudiera modelar todos los casos posibles.
Temporizador y contadores: se adapt o la implementa-
ci on del proyecto original. La misma implementa los Timers
0 y 2 del ATmega103. Se modic o el c odigo para permitir
un mayor reuso ya que ambos timers son muy similares. Es-
tos timers son de 8 bits y permiten su uso como contadores
o generadores de PWM.
USART: se adapt o la implementaci on del proyecto ori-
ginal. La misma es muy exible y permite la selecci on de
distintas tasas de transmisi on y largo de datos.
Watchdog: este perif erico no se encontraba en la imple-
mentaci on original y debido a que se encuentra relacionado
con una instrucci on de la CPU se decidi o implementarlo.
2.4. Bus de Expansi on
En un microcontrolador los perif ericos disponibles son
jos, pero en el caso de una implementaci on en una FPGA
es deseable que los mismos puedan agregarse y/o quitarse
f acilmente. Por esta raz on se decidi o implementar un bus de
expansi on.
Siguiendo los mismos criterios adoptados en el pasado
[1] se seleccion o el est andar de interconexi on WISHBONE
[9]. El mismo posee las siguientes ventajas:
Para casos simples (un maestro y uno o m as esclavos)
se reduce a poca o ninguna l ogica adicional.
Fue pensado para casos m as complejos (m as de un
maestro, reintento, noticaci on de error, etc.).
No posee royalties y puede ser usado sin costo alguno.
La especicaci on completa se encuentra disponible en
internet.
Para acceder al bus WISHBONE se implementaron dos
registros en el espacio de entrada y salida. Debido a que
nuestra implementaci on dej o de lado la memoria EEPROM
de los AVR disponamos de las direcciones 0x1C a 0x1F.
Se decidi o utilizar las direcciones 0x1E y 0x1F. El regis-
tro 0x1F se usa para indicar la direcci on del perif erico que
deseamos utilizar en el bus WISHBONE. Posteriormente
cualquier operaci on sobre el registro 0x1E se transere a
trav es del bus WISHBONE. Esto permite acceder a hasta
256 registros de 8 bits en el bus WISHBONE.
El bus WISHBONE contempla la conexi on de perif eri-
cos lentos por lo que si un perif erico necesita m as de un
ciclo de reloj para realizar la operaci on puede detener al mi-
crocontrolador hasta que la misma haya concluido.
2.5. Conguraciones Equivalentes
A los nes de proveer al usuario nal de conguracio-
nes que sean similares a microcontroladores ya existentes se
implementaron tres microcontroladores, uno por cada gene-
raci on implementada:
ATtiny22: de segunda generaci on, un unico puerto
de entrada y salida de 5 bits, un timer de 8 bits, bus
WISHBONE, stack pointer de 8 bits, watchdog, una
fuente de interrupci on externa, 128 bytes de memoria
RAM, 1024 words de memoria de programa.
2
ATmega103: de tercera generaci on, 6 puertos de en-
trada y salida, UART, bus WISHBONE, dos timers de
8 bits, stack pointer de 16 bits, watchdog, ocho fuen-
tes de interrupci on externa, 4096 bytes de memoria
RAM, hasta 65536 words de memoria de programa.
ATmega8: de cuarta generaci on, 3 puertos de entra-
da y salida, UART, bus WISHBONE, dos timers de
8 bits, stack pointer de 16 bits, watchdog, dos fuen-
tes de interrupci on externa, 1024 bytes de memoria
RAM, hasta 65536 words de memoria de programa.
Todas las conguraciones son parametrizables, pudi en-
dose eliminar los perif ericos no utilizados.
3. HERRAMIENTAS DE DESARROLLO
Debido a que esta implementaci on incluye todo el set
de instrucciones del procesador original, y a que se imple-
mentaron tres conguraciones equivalentes a microcontro-
ladores existentes, es posible utilizar la mayor parte de las
herramientas disponibles para la lnea AVR.
Existen herramientas de software libre de muy buena ca-
lidad disponibles para la lnea AVR y que pueden utilizarse
con nuestro microcontrolador.
3.1. Ensamblador
Para compilar fuentes en assembler escritos para el en-
samblador de Atmel (avrasm) es posible utilizar avra [10].
Se distribuye bajo licencia GPL y se puede compilar para
las plataformas m as populares. Adem as de ser compatible
con el ensamblador de Atmel ofrece un soporte de macros
mejorado y ensamblado condicional.
3.2. Compilador de C/C++
A pesar de tratarse de una plataforma de 8 bits es posible
obtener una versi on del compilador de C del proyecto GNU
[11] capaz de generar c odigo para el AVR. A esta versi on
del gcc se la conoce como gcc-avr y es capaz de generar
c odigo altamente optimizado para AVRs de segunda a quinta
generaci on.
3.3. Biblioteca Est andar de C
Una implementaci on muy completa de la biblioteca es-
t andar de C especialmente dise nada para los AVRse encuen-
tra disponible en el proyecto avr-libc [12]. Esta implementa-
ci on se encuentra especialmente optimizada para los AVRs
y permite realizar aplicaciones exibles y compactas.
Uno de los detalles interesantes de esta biblioteca es que
es posible redireccionar la entrada y salida est andar (stdin y
stdout) a cualquier dispositivo, por ejemplo el puerto serie.
3.4. Depurador
El depurador del proyecto GNU, gdb [13], puede com-
pilarse con soporte para AVR. A esta versi on del gdb se la
conoce como avr-gdb. El mismo es capaz de depurar progra-
mas escritos para los AVR. Debido a que gdb es una aplica-
ci on enorme la misma corre en una PC comunic andose con
un simulador de AVR o bien con el microcontrolador.
GDB es una aplicaci on de lnea de comandos, pero exis-
ten varias interfaces de usuario disponibles para hacer m as
simple su uso.
3.5. Simulador
Un simulador capaz de simular el comportamiento de un
AVR es el simulavr [14]. El mismo corre en una PC y es po-
sible utilizarlo desde el avr-gdb para realizar una depuraci on
del c odigo simulado sin necesidad de disponer de un AVR
real.
4. INTERFAZ DE DEPURACI

ON
4.1. Introducci on
El desarrollo de sistemas electr onicos basados en siste-
mas embebidos presenta un desafo a la hora de eliminar
errores de implementaci on. Esto se debe a que los microcon-
troladores utilizados para estas tareas suelen ser peque nos y
no es posible que ejecuten las complejas tareas involucradas
en la depuraci on de errores. La tarea se diculta a un m as
cuando dichos dispositivos no poseen una interfaz de usua-
rio amistosa, sin salida de vdeo donde conectar un monitor
ni entradas de teclado o similares para ingresar datos.
Para solucionar estos problemas se suele incluir funcio-
nalidad en el microcontrolador que permite realizar las ta-
reas de depuraci on en forma remota utilizando una compu-
tadora personal, donde se encuentran disponibles los recur-
sos antes mencionados.
4.2. Selecci on de la Arquitectura
Los dispositivos modernos de la lnea AVR poseen faci-
lidad de depuraci on a trav es de un puerto JTAG (Joint Test
Action Group). Una posible soluci on a este problema hu-
biera sido implementar una interfaz compatible. La ventaja
de esta soluci on es que se podra haber utilizado cualquier
software ya disponible, sin modicaciones. La desventaja es
que las PCs no poseen interfaz JTAG, esto implica un ca-
ble especial. Este no es el unico problema, en la pr actica lo
que soportan los programas no es el manejo directo de JTAG
sino un protocolo especial que normalmente se implementa
utilizando un microcontrolador. As, estos cables en reali-
dad implican el uso de un microcontrolador que es el que
realmente se comunica por JTAG con el microcontrolador
3
que deseamos depurar. Una soluci on posible era implemen-
tar esta segunda CPU en la misma FPGA.
Por otro lado era necesario implementar una unidad de
depuraci on compatible con la del AVR y someterse a sus li-
mitaciones. Nuestro equipo ya posea experiencia en el desa-
rrollo de una unidad de depuraci on [2], m as exible que la
de los AVR. Por lo que se decidi o adaptar nuestra unidad de
depuraci on y evitar la necesidad de un segundo microcontro-
lador. La desventaja de este mecanismo es que es necesario
adaptar el software para que funcione con nuestro microcon-
trolador.
4.3. Comunicaci on con la PC
Nuestra unidad de depuraci on original es un perif erico
que soporta el est andar de interconexi on WISHBONE. Para
controlar este tipo de perif ericos es necesario poder acceder
al bus WISHBONE, que se encuentra dentro de la FPGA.
Una forma de lograr este acceso es utilizando alg un tipo
de puente. En nuestro trabajo anterior usamos un puente de
puerto paralelo en modo EPP a WISHBONE [15], desarro-
llado por nuestro equipo. En este caso y debido a que el
puerto paralelo ha sido reemplazado casi por completo por
el USB optamos por utilizar un puente de USB a WISH-
BONE [16] [17], tambi en desarrollado por nuestro equipo.
4.4. Caractersticas
Nuestra unidad de depuraci on permite:
Detener/Reanudar la ejecuci on del microcontrolador
en cualquier momento.
Ejecutar su programa paso a paso.
Detener la ejecuci on cuando se alcanz o una posici on
de memoria determinada, punto de parada o break-
point. La cantidad de breakpoints es congurable en-
tre 1 y 256.
Reinicializar el microcontrolador.
Acceder a todos los registros, incluyendo el contador
de programa.
Acceder al espacio de entrada y salida.
Inspeccionar la pila de llamadas (calling stack).
Detener la ejecuci on cuando se accede a una posici on
de memoria de datos, watchpoint. Los accesos pue-
den seleccionarse para detenerse por lectura, escritura
o ambos. El n umero de watchpoints es congurable
entre 1 y 256.
Alterar la memoria de programa.
Detectar desbordes en la pila y detener la ejecuci on
cuando esto sucede.
4.5. Software complementario
Para poder depurar programas corriendo en microcon-
troladores AVR es posible utilizar el avr-gdb, pero este pro-
grama necesita comunicarse con otro programa que es el que
realmente controla al microcontrolador. Un programa muy
usado es el AVaRICE [18] y al ser software libre fue posible
modicarlo. Se agreg o soporte al AVaRICE para controlar
nuestra unidad de depuraci on utilizando el puerto USB.
Como interfaz de usuario para controlar al avr-gdb se se-
leccion o el programa SETEdit [19]. El mismo es el entorno
de trabajo recomendado por el proyecto FPGALibre. SETE-
dit implementa el protocolo GDB/MI para la depuraci on de
programas en C/C++ y assembler por lo que no fueron ne-
cesarios cambios importantes para lograr que el mismo se
adaptara a la depuraci on de programas escritos en lenguaje
C o assembler corriendo en dicho microcontrolador.
4.6. Hardware complementario
El microcontrolador en cuesti on no implementa la fun-
cionalidad de reprogramaci on incluida en el original. Esto
no es un problema importante debido a que las FPGAs son
recongurables y por lo tanto basta con volver a sintetizar el
dise no para modicar el programa ejecutado por el micro-
controlador. Es com un que los depuradores remotos puedan
modicar el programa ejecutado por el sistema embebido,
por lo que para complementar este desarrollo se dise n o un
perif erico WISHBONE capaz de acceder a la memoria de
programa del microcontrolador. Esto permiti o recongurar
dicha memoria sin necesidad de recongurar toda la FPGA.
La Fig. 1 ilustra la interconexi on entre los distintos com-
ponentes de hardware antes mencionados, los bloques ilus-
trados con relleno s olido corresponden a los desarrollos des-
criptos en este trabajo. En la Fig. 2 se muestra el ujo de da-
tos dentro de la computadora, los datos ingresan a trav es del
puerto USB, son tomados por el sistema operativo (Linux)
utilizando operaciones b asicas de entrada/salida y enviados
al espacio de usuario a trav es de la biblioteca libusb, estos
datos son procesados por AVaRICE y traducidos al proto-
colo remoto de gdb y nalmente el depurador (avr-gdb) los
enva a la interfaz de usuario (SETEdit) utilizando el proto-
colo GDB/MI.
5. RESULTADOS
La Fig. 3 muestra una sesi on de depuraci on utilizando
SETEdit. En la misma se observa el c odigo fuente del pro-
grama, el c odigo desensamblado y una ventana utilizada pa-
ra monitorizar el valor de una variable.
Este desarrollo fue vericado utilizando FPGAs Spartan
II y Spartan 3A de Xilinx. Para la sntesis se utiliz o el pro-
grama XST 10.1.02 K.37.
4
Fig. 1. Diagrama de conexiones de los bloques de hardware.
Fig. 2. Flujo de datos dentro de la computadora.
El area ocupada depende de varios par ametros, a conti-
nuaci on se describen algunos ejemplos.
Conguraci on ATmega103 con memoria de programa
de 1024x16, sin perif ericos internos y s olo una UART
peque na en el bus WISHBONE: 644 slices (245 FFs
y 1124 LUTs) 3 BRAMs.
Conguraci on ATmega8 con memoria de programa
de 1024x16, sin perif ericos internos y s olo una UART
peque na en el bus WISHBONE: 707 slices (275 FFs
y 1224 LUTs) 2 BRAMs y un multiplicador.
Conguraci on ATtiny22 con memoria de programa de
1024x16, sin perif ericos internos y s olo una UART
peque na en el bus WISHBONE: 606 slices (237 FFs
y 1053 LUTs) 2 BRAMs.
Conguraci on ATmega8 con memoria de programa
de 1024x16, puerto B habilitado, una UART peque na
en el bus WISHBONE, interfaz de depuraci on con 3
breakpoints y 3 watchpoints (USB): 1548 slices (902
FFs y 2477 LUTs) 4 BRAMs y un multiplicador. Apro-
ximadamente 500 de los slices son necesarios para la
implementaci on del puente de USB a WISHBONE.
S olo en el ultimo caso se le pidi o a la herramienta que
buscara cumplir con una frecuencia de trabajo jada de 24
MHz para todo el circuito, salvo para el PHY de USB que
deba correr a 48 MHz. En el resto de los casos no se pi-
di o ning un requisito en particular y la herramienta repor-
to frecuencias de trabajo de entre 30 y 37 MHz para una
Spartan 3A grado 4.
5.1. Control de Motores de DC
Con la nalidad de vericar el correcto funcionamien-
to del microcontrolador y la capacidad de la unidad de de-
puraci on, se encar o el desarrollo de un control de posici on
para motores de corriente continua. Para dicho control se
decidi o implementar un PID. Dicho desarrollo utiliza los si-
guientes perif ericos conectados al bus WISHBONE:
Modulador PWM de 15 bits de resoluci on.
Decodicador de encoder relativo con 16 bits de reso-
luci on.
UART peque na trabajando a 115200 baudios.
La conguraci on usada es compatible con el ATmega8
con una memoria de programa de 4096x16 y la unidad de
depuraci on habilitada. De los perif ericos internos s olo el
puerto B se encuentra habilitado. Dicha conguraci on in-
sumi o 1732 slices (1009 FFs y 2786 LUTs) 7 BRAMs y un
multiplicador (Spartan 3A).
Habi endose ya obtenido los primeros resultados exito-
sos queda por renar el mecanismo de determinaci on de las
constantes del PID.
6. CONCLUSIONES
La elecci on de la arquitectura AVR permiti o contar con
una amplia gama de herramientas y bibliotecas. Al mismo
tiempo se comprob o que programar este dispositivo es tan
f acil como programar un AVR comercial.
La unidad de depuraci on obtenida es poderosa, capaz de
realizar la mayor parte de las operaciones realizadas por de-
puradores utilizados en computadoras personales, y que es
de gran ayuda a la hora de buscar errores en sistemas fun-
cionando en tiempo real.
El uso de FPGAs posee como ventaja el hecho de que la
unidad de depuraci on puede removerse en la versi on deni-
tiva del dise no con lo que el mismo no ocupa recursos en el
dispositivo nal. En el caso en que el dise no ocupe pr acti-
camente el total de la FPGA basta con usar una FPGA m as
grande durante la etapa de desarrollo, a los nes de incluir
unidades de depuraci on como esta.
La selecci on del est andar de interconexi on WISHBONE
permiti o el reuso de un puente de puerto USB y abre la posi-
bilidad a la implementaci on de otro tipo de mecanismos de
comunicaci on, como podran ser RS-232 o Ethernet.
La elecci on de modicar un programa como AVaRICE
permiti o acelerar notablemente el desarrollo y reusar interfa-
ces de usuario ya existentes y con las cuales nuestro equipo
ya se encontraba familiarizado.
La utilizaci on de las herramientas propuestas por el pro-
yecto FPGALibre mostr o ser adecuada para este desarrollo.
5
Fig. 3. Sesi on de depuraci on.
7. REFERENCIAS
[1] S. E. Tropea and J. P. D. Borgna, Microcontrolador com-
patible con PIC16C84, bus WISHBONE y video, in FPGA
Based Systems. Mar del Plata: Surlabs Project, II SPL, 2006,
pp. 117122.
[2] S. E. Tropea, Interfaz de depuraci on para microcontrolador,
in 2008 4th Southern Conference on Programmable Logic
Designer Forum Proceedings, Bariloche, 2008, pp. 105108.
[3] R. M. Cibils, A. Busto, J. L. Gonella, R. Martinez, A. J. Chie-
lens, J. M. Otero, M.

Nu nez, and S. E. Tropea, Wide range
neutron ux measuring channel for aerospace application,
in Space Technology and Applications International Forum-
STAIF 2008 Proceedings, vol. 969, New M exico, 2008, pp.
316325.
[4] S. E. Tropea, D. J. Brengi, and J. P. D. Borgna, FPGAlibre:
Herramientas de software libre para dise no con FPGAs, in
FPGA Based Systems. Mar del Plata: Surlabs Project, II
SPL, 2006, pp. 173180.
[5] INTI Electr onica e Inform atica et al., Proyecto FPGA Li-
bre, http://fpgalibre.sourceforge.net/.
[6] Debian, Sistema operativo Debian GNU/Linux, http://-
www.debian.org.
[7] GNU project, http://www.gnu.org/.
[8] R. Lepetenok. (2009, Nov.) AVR Core. OpenCores.org. [On-
line]. Available: http://www.opencores.org/project,avr core
[9] Silicore and OpenCores.Org, WISHBONE System-on-Chip
(SoC) interconnection architecture for portable IP cores,
http://prdownloads.sf.net/fpgalibre/wbspec b3-2.pdf?down-
load.
[10] (2009, Nov.) Avr assembler (avra). [Online]. Available:
http://avra.sourceforge.net/
[11] (2009, Nov.) GCC, the GNU compiler collection. [Online].
Available: http://gcc.gnu.org/
[12] M. Michalkiewicz, J. Wunsch et al. (2009, Nov.) AVR C run-
time library. [Online]. Available: http://www.nongnu.org/-
avr-libc/
[13] (2009, Nov.) GDB: The GNU project debugger. [Online].
Available: http://www.gnu.org/software/gdb/
[14] T. A. Roth et al. (2009, Nov.) Simulavr: an AVR
simulator. [Online]. Available: http://savannah.nongnu.org/-
projects/simulavr/
[15] A. Trapanotto, D. J. Brengi, and S. E. Tropea, Puente IEEE
1284 en modo EPP a bus WISHBONE, in FPGA Based Sys-
tems. Mar del Plata: Surlabs Project, II SPL, 2006, pp. 257
264.
[16] R. A. Melo and S. E. Tropea, IP core puente USB a WISH-
BONE, in XV Workshop Iberchip, vol. 2, Buenos Aires,
2009, pp. 531533.
[17] S. E. Tropea and R. A. Melo, USB framework - IP core and
related software, in XV Workshop Iberchip, vol. 1, Buenos
Aires, 2009, pp. 309313.
[18] S. Finneran. (2009, Nov.) AVR in circuit emulator. [Online].
Available: http://avarice.sourceforge.net/
[19] Salvador E. Tropea et al., SETEdit, un editor de texto ami-
gable, http://setedit.sourceforge.net.
6
EXPERIENCIA ACADMICA SOBRE INCORPORACIN DE LA METODOLOGA DE
DISEO BASADA EN HDL EN UNA CARRERA DE INGENIERA ELECTRNICA
Roberto Martnez, Rosa Corti, Estela DAgostino, Javier Belmonte, Enrique Giandomnico
Facultad de Ciencias Exactas, Ingeniera y Agrimensura
Universidad Nacional de Rosario (FCEIA/UNR)
Avenida Pellegrini 250, (2000) Rosario, Argentina
email: romamar, rcorti, estelad, belmonte, giandome@fceia.unr.edu.ar
RESUMEN
En este trabajo se describe la planificacin e
implementacin parcial de los cambios necesarios para
introducir los HDL como metodologa de diseo en el rea
digital de una carrera de Ingeniera Electrnica. Para lograr
una fluida integracin de la temtica con los contenidos
conceptuales de las asignaturas se aprovechan, como
herramientas didcticas, las ventajas de los HDL y los
ambientes de diseo asociados. Se evala el impacto de las
modificaciones llevadas adelante en dos asignaturas
durante el corriente ao, comparando los resultados de las
evaluaciones en 2009 con las de aos anteriores y
encuestas de opinin a los alumnos cursantes. Finalmente,
se utilizan estas mediciones para obtener conclusiones, y
determinar las lneas de trabajo a futuro.
1. INTRODUCCION
La ingeniera electrnica, al igual que otras disciplinas de
base tecnolgica, produce avances acelerados
incorporando una amplia gama de novedades que, a su vez,
impulsan la inclusin de otras. Este proceso provoca el
reemplazo de tecnologas que han quedado obsoletas, por
otras de ltima generacin. De esta manera, el ciclo de
vida til de algunos instrumentos tecnolgicos puede ser
muy corto. Ante la situacin planteada, existe una
preocupacin presente en el cuerpo docente: la de
seleccionar los contenidos ms significativos del programa
de estudio y, en algunos casos, los de mayor vigencia en el
tiempo. Asimismo, los contenidos y su naturaleza,
determinarn la metodologa de enseanza ms apropiada.
Una aspiracin siempre presente en el docente
(fundamentalmente en los ltimos cursos), respecto de la
utilidad de los temas, es que stos tengan cierta
vigencia, al menos en los primeros aos de la futura
actuacin profesional del alumno. El plan de estudios de
una carrera tecnolgica, como Ingeniera Electrnica debe
cumplir con dos objetivos: que el alumno aprenda los
fundamentos cientficos de la electrnica, pero al mismo
tiempo incorpore, en su manejo, las ltimas tecnologas
emergentes. Notamos tambin, que el abordaje de esta
problemtica se ve dificultada por la ausencia de estudios
sobre el aspecto epistemolgico de nuestra disciplina,
poco se ha dicho sobre la manera en que este cuerpo de
conocimientos va construyendo sus categoras y sus
mtodos. Hay que tener presente que, a su vez, la
estructuracin de los contenidos conlleva en s mismo
implicaciones metodolgicas. En particular, el diseo de
los sistemas digitales, desde la perspectiva disciplinar, es
decir, conjunto de conocimientos estructurados para su
enseanza, debe ser acompaada por un adecuado marco
pedaggico que la facilite. La teora constructivista del
proceso de enseanza aprendizaje, se basa en que la
realidad que creemos conocer es activamente construida
por el sujeto cognoscente, el alumno. Llegar a conocer, en
el constructivismo, es un proceso adaptativo que organiza
el mundo experiencial del sujeto. Muchos autores han
realizado aportes en esta lnea de pensamiento, entre ellos,
Ausubel [1], introduce dos conceptos claves, el de
aprendizaje significativo y el de inclusores. El aprendizaje
significativo resulta ser aquel por el cual las ideas
expresadas simblicamente son relacionadas en forma
sustancial con lo que el alumno ya sabe, con los
conocimientos previos. Es decir referenciadas con algn
aspecto esencial de su estructura cognitiva. Para Ausubel,
la forma ms relevante del aprendizaje significativo se da
cuando las nuevas ideas se relacionan subordinadamente
con ideas relevantes de mayor nivel de abstraccin,
generalidad e inclusividad. A estos conocimientos previos,
que sirven de anclaje para los nuevos conceptos, les llama
inclusores. Tambin en este modelo, Bruner [2], desde la
perspectiva curricular, propone que el currculum debe
organizarse de forma espiralada, es decir, trabajando
peridicamente los mismos contenidos, cada vez con
mayor profundidad. Esto facilita que el alumno modifique
las representaciones mentales que ha venido construyendo,
en un proceso continuo. Estas consideraciones se han
tenido en cuenta para la planificacin de los cambios que
aqu se describen. El resto del trabajo se organiza de la
siguiente forma: en la seccin 2 se describe la situacin
actual del rea digital, la seccin 3 presenta las
modificaciones propuestas cuya implementacin se analiza 7
7
en la seccin 4. Finalmente, la seccin 5 enumera las
conclusiones alcanzadas y plantea las lneas de trabajo a
futuro.
2. SITUACIN ACTUAL DEL REA DIGITAL
La vertiginosa evolucin de la electrnica digital, con las
herramientas y tecnologas asociadas, ha revolucionado la
manera de analizar, disear y sintetizar los sistemas
digitales. Los lenguajes de descripcin de hardware
(HDL) junto a los ambientes de apoyo al diseo
electrnico (EDA) incorporan la metodologa de diseo
Top-Down y como plantean los autores en [3], son las
fuerzas impulsoras del desarrollo de la microelectrnica.
En este contexto, coincidimos con lo que se plantea en
[4] respecto de que la enseanza de los HDL orientados al
diseo, son una herramienta de calidad para el aprendizaje
de los sistemas digitales. Asimismo, la disponibilidad de
tarjetas de desarrollo en base a FPGA, facilitadas al mbito
acadmico por las principales empresas proveedoras de
esta tecnologa (Xilinx, Altera) [5], [6], va programas
universitarios, han permitido encarar un proceso
enseanza-aprendizaje basado en proyectos (Project Based
Learning PLB) y cuya aplicacin y potencialidades en la
enseanza de la disciplina que nos ocupa, han sido
reportados en varios trabajos [7], [8], [9].
En la carrera de Ingeniera Electrnica de la
FCEIA/UNR el rea Digital est integrada por tres
asignaturas obligatorias, Digital I, II y III y varias
asignaturas optativas cuyas currculas pueden consultarse
en [10]. La metodologa de diseo basada en HDL, se trata
nicamente en una asignatura electiva que, debido a su
carcter, cursan slo una parte de los alumnos. Sin
embargo, considerando la importancia de la temtica se
evalu que era necesario incorporarla en asignaturas
obligatorias para que constituyera parte de la formacin
integral de los futuros ingenieros.
3. MODIFICACIONES
El objetivo de las modificaciones aqu expuestas, fue la
incorporacin de la metodologa de diseo basada en HDL
en las asignaturas obligatorias del rea digital de la carrera
de Ingeniera Electrnica. Para lograrlo, se tomaron como
base los cambios consensuados con los docentes que
vienen introducindose en la carrera desde 2001, y que se
detallan en [11].
La tarea se planific integrando la temtica de inters
con los conceptos tratados en cada una de las materias,
para aprovechar la potencialidad de los HDL como
herramientas didcticas [12]. La planificacin tuvo en
cuenta los contenidos curriculares de cada asignatura, y la
secuencialidad adecuada para la enseanza de los nuevos
conocimientos. Los temas se distribuyeron de la siguiente
forma:
Digital I: Se incorpora una introduccin al VHDL,
unidades de diseo, descripcin de sistemas
combinacionales, conceptos bsicos de sistemas
secuenciales y descripcin de una red de Petri haciendo
uso del mtodo de traduccin directa descripto en [13].
Digital II: Se profundiza la descripcin VHDL de
sistemas combinacionales y secuenciales iniciado en
Digital I, incorporando otros bloques estndar a nivel RTL
analizando su personalizacin. Utilizando la metodologa
Top-Down se integran en un sistema de complejidad media
las descripciones con esquemticos y VHDL.
Digital III: Se hace nfasis en los distintos tipos de
descripcin de Mquinas de Estado Finito (MEF) y su
aplicacin en el modelo Control-Data Path. Adems se ha
planificado trabajar con el estilo de descripcin estructural,
el manejo de bibliotecas y re-utilizacin de mdulos,
relacionndolos con la arquitectura de procesadores e
interconexin de perifricos. Estas modificaciones se
implementarn en el primer semestre de 2010.
4. IMPLEMENTACIN DE LOS CAMBIOS
Las modificaciones se implementaron en el curso del ao
2009. Durante el primer semestre se trabaj en Digital I, y
considerando los resultados alcanzados se avanz en el
segundo semestre con las modificaciones en Digital II.
4.1. Digital I
La asignatura Digital I es la primera materia que trata
sobre los sistemas digitales. El programa esta estructurado
de manera tal que en un comienzo se estudian los sistemas
combinacionales y sus distintos estilos de descripcin y
sntesis. Posteriormente se aborda el diseo de sistemas
secuenciales, se da el marco terico general de su
modelizacin (representacin Mealy y Moore) y se utilizan
las Redes de Petri para el modelado de sistemas
secuenciales de baja y media complejidad. La resolucin
de problemas est orientada a los sistemas secuenciales de
caractersticas industriales, caracterizados por un nmero
importante de entradas, fuertemente no especificados y con
evoluciones paralelas o uso de recursos compartidos,
sistemas estos donde las Redes de Petri se muestran muy
eficientes en su modelizacin. Finalmente, se aborda la
implementacin hardware a travs de la sntesis cableada
(puertas, flip flop y PROM)) y programada (PLC). Los
alumnos realizan trabajos prcticos en laboratorio sobre
8
PLC, donde implementan un sistema digital de baja
complejidad modelado con una Red de Petri.
VHDL se introdujo como una representacin ms del
comportamiento de los mdulos combinacionales
elementales. Es decir, a la tabla de verdad y a la expresin
booleana, que modela un mdulo, se le agreg una
sentencia VHDL en estilo flujo de datos. Desde luego, a
esta altura del desarrollo no se poda profundizar en temas
del lenguaje tales como tipos de datos o estilos de
descripcin. Cuando fue el momento de tratar los circuitos
combinacionales, donde se deben interconectar varias
compuertas para formar el circuito total, se incorpor el
concepto de entidad y arquitectura. Posteriormente, se
introdujo el concepto de descripcin secuencial de un
elemento concurrente, a travs de la sentencia process. La
modelizacin de un flip-flop y un contador result
adecuada para ejemplificar la descripcin algortmica en
VHDL. En la ltima fase, una vez desarrollado el tema de
redes de Petri, a las implementaciones tradicionales
(cableada y PLC) se le agreg la descripcin VHDL de la
red de Petri, utilizando un mtodo de traduccin directa
propuesto en [13]. En todo el desarrollo del nuevo tema,
prevaleci la idea base de ensear este lenguaje de
descripcin de hardware orientado al diseo [4]. La
utilizacin de la herramienta de simulacin, en el ambiente
de desarrollo, fue de gran utilidad para el proceso de
enseanza-aprendizaje. Como consecuencia de ser la
primera experiencia de incorporacin temprana de VHDL,
las clases en laboratorio sobre el tema, fueron realizadas
slo por un grupo de catorce alumnos (grupo piloto) de un
total de sesenta y dos que componan el curso. El resto de
los estudiantes (48) trabaj en la modalidad habitual en los
laboratorios. El grupo piloto realiz un laboratorio
consistente en la modelizacin, por medio de una Red de
Petri, de un sistema secuencial de baja complejidad. El
modelo resultante se describi en VHDL y se verific su
comportamiento por medio de simulacin en el entorno de
trabajo ISE WebPack [5].
En la etapa de valoracin de los conocimientos
adquiridos sobre VHDL orientado al diseo, se realiz una
prueba de evaluacin a toda la poblacin del curso. En la
Tabla 1 se muestra un resumen de los resultados.
Tabla 1. Resultados de la evaluacin del tema VHDL en
Digital I

Nota ( 0 a 100 puntos)
No realizaron Lab VHDL 48 19 6 23
Realizaron Lab VHDL 14 11 2 1
Total 62 30 8 24
Cantidad de alumnos
80 o
ms
Entre
79 y 60
Menos
de 60
Se observa una sensible diferencia entre el grupo
piloto y el resto de los alumnos. Efectivamente, en el grupo
piloto el 79% obtuvo 80 o ms puntos sobre 100 mientras
que en el otro grupo esta calificacin la obtuvo el 40%.
Asimismo, en el grupo piloto slo el 7% no alcanz a
aprobar mientras que en el otro grupo no alcanz una
calificacin satisfactoria el 47%. Podemos resumir al
respecto, que el trabajo en el ambiente de desarrollo, con
actividades de simulacin y verificacin de
comportamiento, tuvo una influencia altamente benfica en
el aprendizaje. Finalmente, no podemos obviar, que en
estos resultados tambin incidi favorablemente la
interaccin alumno-alumno y alumno-profesor que se
establece en los grupo de trabajo en laboratorio.
4.2. Digital II
Digital II forma parte del ciclo superior de la carrera de
Ingeniera Electrnica y es la segunda de las tres materias
obligatorias del rea digital. Est caracterizada como una
asignatura de tecnologa aplicada, ya que sus contenidos
tericos apuntan directamente a la solucin de problemas
prcticos del rea, utilizando recursos y dispositivos
concretos. En esta materia se abordan dos bloques de
conocimientos, que si bien tienen fuertes puntos de
contacto presentan caractersticas diferenciadas: Diseo
constructivo de sistemas basado en el uso de bloques
funcionales/Lgica Programable y Arquitectura Bsica de
Microprocesadores/Programacin Assembler.
En la asignatura se introduce el diseo de sistemas
digitales con un enfoque de tipo Top-Down, mediante el
cual se divide el problema a abordar en mdulos ms
sencillos, con el objetivo de poder describirlos como la
interconexin de bloques funcionales a nivel RTL. Se
trabaja con el entorno ISE WebPack [5], y se utiliza el
flujo de ingreso por esquemticos incorporando los
bloques necesarios como elementos de biblioteca o como
mdulos personalizados si no se dispone de la
funcionalidad requerida. En este marco, utilizando los
conceptos bsicos de diseo con VHDL introducidos en el
semestre anterior en Digital I, se avanz en la
implementacin de los mdulos obtenidos a partir de la
particin Top-Down del sistema, con VHDL. La
metodologa utilizada se fundament en la resolucin de
problemas a partir de un conjunto de requerimientos, para
lo cual el docente plantea y analiza distintas opciones de
solucin, seguido por los alumnos en sus puestos de
trabajo con el ambiente de diseo instalado, estudiando el
impacto de las distintas descripciones sobre las
caractersticas y el comportamiento de los circuitos
9
logrados. En este sentido, se trabaj sobre tres ejes
ntimamente relacionados:
Estilo de descripcin VHDL de los dispositivos
Esquemtico a nivel RTL asociado
Simulacin de comportamiento
El trabajo, relacionando estos tres aspectos en un
mismo diseo, permiti ubicar a los alumnos en el
contexto de la descripcin de elementos hardware y sus
conexiones, ya que podan visualizar el sistema como
interconexin de bloques RTL ya conocidos, verificar los
cambios de comportamiento mediante simulacin, y
relacionarlos con la forma en que se haba descripto el
circuito en VHDL.
Los problemas encarados integraron en su solucin el
diseo con esquemas y con VHDL. Este enfoque permiti
comparar las dos formas de descripcin de un circuito
poniendo en evidencia los beneficios de cada una. Las
descripciones VHDL demostraron gran flexibilidad y
brindaron la posibilidad de parametrizar el cdigo. Estas
caractersticas facilitaron la personalizacin de la
funcionalidad de los mdulos.
Por otro lado, los alumnos verificaron que este tipo de
descripcin simplificaba los cambios durante el proceso de
depuracin. Qued claro que estas caractersticas son
fundamentales al incrementarse la complejidad de los
diseos. Por su parte, si la complejidad del sistema no es
muy grande, los esquemas presentan la funcionalidad del
circuito en forma muy clara al mostrar la conexin grfica
de mdulos, constituyendo un apoyo importante durante el
desarrollo del tema. Desde el punto de vista metodolgico,
el conocimiento de los bloques disponibles en biblioteca,
simplific el anlisis de las distintas descripciones
realizadas y ayud a comprender el impacto que tienen los
cambios en la descripcin VHDL sobre las caractersticas
y comportamiento del circuito.

0
5
10
15
20
25
30
Alta Media Baja Muy baja No opina
Claridad clases
Integracion teoria/ practica
Calidad del material didctico


Fig. 1 Desarrollo de las clases

Tabla 2. Importancia e inters en el tema

Alta Media Baja
Muy
baja
No
opina
Importancia
del tema
32,6% 37,2% 7,0% 4,7% 18,6%
Inters
despertado
41,9% 44,2% 11,6% 2,3% 0,0%

Tabla3. Evaluaciones

Alta Media Baja
Muy
baja
No
opina
Dificultad de las
evaluaciones
27,9% 69,8% 0,0% 0,0% 2,3%
Coherencia entre
nivel de clases y
evaluaciones
34,9% 51,2% 9,3% 2,3% 2,3%
La evaluacin de los nuevos contenidos se realiz
mediante un examen parcial individual y un trabajo
prctico grupal encarado por equipos de dos alumnos. La
evaluacin parcial consisti en la descripcin VHDL de un
mdulo sencillo, en el cual los requerimientos sobre
interfaz y funcionalidad se establecieron considerando la
duracin de la prueba (1 hora). El trabajo prctico fue de
un nivel de complejidad superior, ya que los alumnos
trabajaron en grupo disponiendo de un perodo de dos
semanas durante las cuales contaron con el apoyo de los
docentes para resolver el problema planteado. En este
sentido, debieron abordar el diseo de un sistema digital
con un enfoque jerrquico, en el cual varios de los mdulos
constitutivos deban desarrollarse en VHDL. Cabe
destacar, que el porcentaje de alumnos con calificaciones
superiores a 6 (Aprobado) en el primer bloque temtico de
la materia, se increment de un promedio del 45% (2006 a
2008) a un 55% en 2009.
Para evaluar el impacto que tuvieron los cambios
introducidos sobre los alumnos, se implement una
encuesta annima y voluntaria una vez concluidas las
evaluaciones del primer mdulo de la materia, que fue
respondida por cuarenta y tres estudiantes. En la misma, se
realizaron preguntas referidas a distintos aspectos del
10
trabajo realizado y al inters que los nuevos conocimientos
haban despertado, cuantificando cada aspecto con la
escala: alta, media, baja, muy baja y no opina.
En la Tabla 2 se muestra que el 86% de los encuestados
declar tener un inters entre alto y medio por la nueva
temtica y el 70% calific de igual manera la importancia
que, a su juicio, tenan los nuevos conocimientos para su
formacin profesional.
La Fig. 1 muestra que la mayora de los estudiantes
manifest estar satisfecho respecto de la claridad de las
clases impartidas, la integracin de aspectos tericos y
prcticos y la calidad del material didctico entregado por
la ctedra.
Los valores de la Tabla 3 ponen de manifiesto que la
mayora de los encuestados opina que la coherencia entre
la profundidad en el desarrollo de los temas y la dificultad
de las evaluaciones es razonable.
Los alumnos manifestaron, en un campo de libre
respuesta dedicado a sugerencias, su inters de completar
el flujo de diseo implementando los circuitos sobre placas
de desarrollo. Esto sera muy importante, ya que segn se
ha verificado en las asignaturas optativas del rea, el poder
implementar los resultados de los diseos es muy
conveniente para reafirmar los conocimientos e incentivar
el inters de los estudiantes. Est previsto incorporar este
tipo de actividades en cursados posteriores a partir de una
actualizacin de los laboratorios que se encuentra en curso.
5. CONCLUSIONES Y TRABAJO FUTURO
Los resultados informados en este trabajo son todava
parciales toda vez que la etapa de implementacin est
todava en curso, no obstante esto, podemos considerarlos
satisfactorios. Se puede decir que se ha verificado la
utilidad de los HDL y los EDA como herramientas
didcticas en la enseanza del diseo digital y se ha
incorporado, en las materias de grado, un poderoso
instrumento para el modelado, simulacin e
implementacin de sistemas digitales. La metodologa
aport flexibilidad para efectuar cambios y permiti
parametrizar los diseos. Sin embargo, pese al acento
puesto en relacionar las descripciones HDL con el
hardware asociado, subsiste la tendencia de los alumnos a
utilizarlos como lenguajes de alto nivel tradicionales. Los
estudiantes tienden a seguir pensando la descripcin como
puramente secuencial, olvidando que el objeto que ahora se
describe es de una naturaleza completamente diferente. Se
esfuerzan en buscar un cdigo compacto, ms que en
lograr una buena descripcin acorde con los
requerimientos.
Finalmente se prev incorporar una nueva optativa al
rea, Diseo digital orientado a SoC. Los temas
centrales a desarrollar sern: IP cores (software y
hardware), procesadores embebidos de 8 y 16 bits
(arquitectura y programacin), ambientes integrados de
trabajo, aplicaciones de procesamiento de seal, control y
comunicacin RF.
6. REFERENCIAS
[1] Ausubel, D.P., Novak J.D., Hnesian, H., Sicologa
Educativa: un punto de vista cognoscitivo, Editorial Trillas,
Mxico, 1986 (de orig. 1978).
[2] Bruner J. S., Desarrollo cognitivo y educacin, Editorial
Morata, Madrid, 1988.
[3] M. Castro, S. Acha, J. Perez, A. Hilario, J.V. Miguez, F. Mur,
F. Yeves, J. Peire, "Digital systems and electronics curricula
proposal and tool integration," in Proc 30
th
ASEE/IEEE
Annual Frontiers in Education, vol. 2, 2000, pp.F2E/1-
F2E/6.
[4] V.A. Pedroni, Teaching design-oriented VHDL, in Proc. of
the 2003 IEEE International Conference on Microelectronic
Systems Education (MSE03), 2003, pp 6- 7.
[5] Xilinx Inc, en www.xilinx.com .
[6] Altera, en www.altera.com
[7] J. Macas-Guarasa et al. "A project-based learning approach
to design electronic system curricula", IEEE Trans.
Education, 49(3) 2006..
[8] J. Northern "Project-Based learning for a digital circuits
design sequence" in Proc. IEEE Region. 5 Technical Conf.,
2007 USA.
[9] F. Machado, S. Borromeo, N. Malpica, Project based
learning experience in VHDL digital electronic circuit
design, Microelectronic Systems Education, MSE '09. IEEE
International Conference on, 2009, pp 49-52.
[10] www.dsi.fceia.unr.edu.ar
[11] R. Corti, R Martnez, E. DAgostino, E. Giandomnico,
Experiencia didctica en una carrera de Ingeniera
Electrnica. Actualizacin de los contenidos del rea
digital, Revista de Enseanza de la Ingeniera, ao 7, no.
13, pp. 6172, Dic. 2006.
[12] G. Baliga, J. Robinson, L. Weiss L., "Revitalizing CS
hardware curricula: object oriented hardware design", The
Journal of Computing Sciences in Colleges, 25(3): 60-66,
2010
[13] R. Martnez, J. Belmonte, R. Corti, E. DAgostino, E.
Giandomnico, Descripcin en VHDL de un sistema digital
a partir de su modelizacin por medio de una Red de Petri,
in Proc. FPGA Designer Forum, 2009 SPL, pp 7-11.
11

12

AUDIOSOBREETHERNET:IMPLEMENTACINUTILI=ANDOFPGA
-oseMosquera,AnaresStoliar,SolPeare,MaximilianoSaccoyPatriciaBorens:tefn
'HpDUWDPHQWRGH&RPpXWDFLyQ
)DFXOWDGGH&LHQFLDV([DFWDV\NDWXUDOHV
8QLYHUVLGDGGH%XHQRV$LUHV
HPDLOjRVHPRVTXHUD#RUDFOHFRP^DVWROLDUVpHGUHPVDFFRpDWULFLD`#GFXbDDU
RESUMEN
(Q HVWH DUWtFXOR VH pUHVHQWD XQD DpOLFDFLyQ LPpOHPHQWDGD
VRbUHXQD)3*$\GHVDUUROODGDFRQHOPpWRGRGHPiTXLQDV
GH HVWDGR ILQLWRV FRQ FDPLQR GH GDWRV )60' /D
DpOLFDFLyQ FRQVLVWH HQ OD WUDQVIHUHQFLD GH VHxDOHV GH DXGLR
pURYHQLHQWHVGHOFRGHFGHDXGLR8&%HPbHbLGRHQOD
pODFD GH DXGLR \ YLGHR $'6 D XQD pODFD
FRQWHQLHQGRXQD)3*$GH;LOLQ[9LUWH[);TXHUHDOL]D
OD FRQYHUVLyQ GH pURWRFRORV pDUD HPpDTXHWDU HO DXGLR \
HQYLDUOR D OD UHG YLD XQD LQWHUID] (WKHUQHW pUHVHQWH HQ OD
pODFD

1.INTRODUCCIN
/DFUHFLHQWHFRPpOHjLGDGHQHOGLVHxRGHVLVWHPDVGLJLWDOHV
KDFH QHFHVDULR OD XWLOL]DFLyQ GH PRGHORV TXH pHUPLWDQ
GHVFULbLU HO FRPpRUWDPLHQWR GHO VLVWHPD GH IRUPD TXH QR
VROR FDpWXUHQ WRGD VX IXQFLRQDOLGDG VLQR TXH DGHPiV
pXHGDQ VHU IiFLOPHQWH WUDGXFLGRV D KDUGZDUH R D XQ
OHQJXDjHGHGHVFULpFLyQGHKDUGZDUH7DPbLpQHVGHVHDbOH
TXH HVWRV PRGHORV D\XGHQ HQ OD ~OWLPD HWDpD GH
YHULILFDFLyQ IXQFLRQDO GHO VLVWHPD /DV 0iTXLQDV GH
(VWDGRV )LQLWRV )LQLWH 6WDWH 0DFKLQH FRQ &DPLQR GH
'DWRV 'DWD 3DWK FRQVWLWX\HQ XQD PHWRGRORJtD
DPpOLDPHQWH XWLOL]DGD HQ HO GLVHxR GH VLVWHPDV GLJLWDOHV
%DViQGRQRV HQ HVWD PHWRGRORJtD PX\ bLHQ GHVFULWD HQ HO
OLbURGH 3RQJ3&KX >@ KHPRVLPpOHPHQWDGRXQ VLVWHPD
GLJLWDO VRbUH XQD pODFD GH GHVDUUROOR TXH RbWLHQH DXGLR GH
XQ PLFUyIRQR \ OR HPpDTXHWD pDUD VX HQYtR WUDYpV GH XQD
LQWHUID] (WKHUQHW /D DpOLFDFLyQ HV XQ HjHPpOR GHO XVR GH
ORV )3*$ pDUD OD LPpOHPHQWDFLyQ GH LQWHUIDFHV HQWUH
GLVWLQWRV GLVpRVLWLYRV GH HQWUDGD VDOLGD (O RbjHWLYR GHO
GHVDUUROOR FRQVLVWH HQ WRPDU HVWD DpOLFDFLyQ FRPR &DVR GH
(VWXGLR pDUD ORJUDU XQD HVpHFLILFDFLyQ FRPpOHWD PHGLDQWH
OD PHWRGRORJtD )60' )60 FRQ 'DWD 3DWK TXH pHUPLWD
DXWRPDWL]DU OD JHQHUDFLyQ GH FyGLJR +'/ +DUGZDUH
'HVFULpWLRQ /HQJXDjH 8QD FDUDFWHUtVWLFD TXH KDFH
LQWHUHVDQWHODDpOLFDFLyQHOHJLGDFRQVLVWHHQODGLIHUHQFLDGH
IUHFXHQFLDVHQWUHODDGTXLVLFLyQ GH ODVPXHVWUDVGHDXGLR \
VXHQYtRDWUDYpVGHODUHG(VWDGLIHUHQFLDGHYDULRVyUGHQHV
GHPDJQLWXGVHUHVROYLy PHGLDQWHXQPyGXOR HVpHFtILFRGH
DGDpWDFLyQHQWUHIUHFXHQFLDV
(VWH DUWtFXOR HVWi RUJDQL]DGR GH OD VLJXLHQWH PDQHUD HQ OD
6HFFLyQVHGHVFULbHHOpUR\HFWRDUHDOL]DU\VXVLQWHUIDFHV
HQ OD 6HFFLyQ VH UHDOL]D HO GLVHxR GHO VLVWHPD \ HQ OD
6HFFLyQ OD LPpOHPHQWDFLyQ XWLOL]DQGR OD PHWRGRORJtD GH
GLVHxRHOHJLGD3RU ~OWLPRHQ OD 6HFFLyQVH pUHVHQWDQODV
FRQFOXVLRQHV

2.DESCRIPCINDELPRO<ECTO
(O pUR\HFWR FRQVLVWH HQ OD XWLOL]DFLyQ GH XQD )3*$ pDUD
LPpOHPHQWDU OyJLFD TXH pHUPLWD WUDQVIHULU D OD UHG YtD XQD
LQWHUID] (WKHUQHW VHxDOHV GH DXGLR pURYHQLHQWHV GH XQ
PLFUyIRQR
/D QDWXUDOH]D GH OD GLJLWDOL]DFLyQ GH VHxDOHV GH DXGLR
VXpRQHXQPXHVWUHRDXQDIUHFXHQFLDFRQVWDQWHGHORUGHQGH
ODVGHFHQDVGHNLORKHUW]PLHQWUDVTXHODXWLOL]DFLyQGHXQD
LQWHUID] (WKHUQHW ,((( >@ LPpOLFD HO HQYtR GH
pDTXHWHVDWDVDVGHGHFHQDVGHPHJDKHUW]GRQGHHOWLHPpR
GHHQYtRHQWUHpDTXHWHVQRQHFHVDULDPHQWHHVFRQVWDQWH
(VWD GLIHUHQFLD HQ HO PRGR GH RpHUDFLyQ HQWUH DPbDV
LQWHUIDFHVHVORTXHQRVPRWLYyDHQFDUDUODLPpOHPHQWDFLyQ
GH VX LQWHJUDFLyQ HQ XQD )3*$ (Q OD )LJ VH LOXVWUD OD
DUTXLWHFWXUDGHOpUR\HFWR
3DUD OD LPpOHPHQWDFLyQ GHO pUR\HFWR VH XWLOL]y HO NLW
9LUWH[ ); GH OD HPpUHVD $YQHW (OHFWURQLFV HO FXDO
FRQWLHQH XQD )3*$ ;LOLQ[ 9LUWH[ >@ )RUPDQ pDUWH GHO
PLVPR NLW XQ pRUW (WKHUQHW bDVH 7 \ FRQHFWRUHV
GH pURpyVLWR JHQHUDO $Y%XV >@ D WUDYpV GHO FXDO VH
FRQHFWDXQDpODFDGH$XGLR\9LGHR>@
$FRQWLQXDFLyQVHGHVFULbHQORVFRPpRQHQWHVGHHQWUDGD
\VDOLGDGHOVLVWHPD
2.1.Digitalizacindelasealdemicrfono
/D GLJLWDOL]DFLyQ GH DXGLR VH bDVy HQ HO FRGHF GH DXGLR
8&% >@ LQFOXLGR HQ OD pODFD GH $XGLR \ 9LGHR 8Q
FRGHF HV XQ GLVpRVLWLYR TXH GLVpRQH GH LQWHUIDFHV
DQDOyJLFDV \ GLJLWDOHV GH HQWUDGD \ VDOLGD UHDOL]DQGR OD
FRQYHUVLyQ DQDOyJLFDGLJLWDO \ GLJLWDODQDOyJLFD
VLPXOWiQHDPHQWH
13

7DbOD6XbFRQjXQWRGHVHxDOHV8&%
6HxDO
'HVFULpFLyQ
AC97BBITBCLK 0+]VHULDOGDWDFORFN
AC97BSYNC
6HxDOFRPDQGDGDpRUHO$&
FRQWUROOHUTXHLQGLFDHO
FRPLHQ]RGHXQIUDPH
AC97BRESET 5HVHWGH8&%
AC97B
SDATABIN
8&%GDWDRXWpXWVWUHDP
7DbOD6XbFRQjXQWRGHVHxDOHVGHO(0$&
6HxDO
'HVFULpFLyQ
TXDJLD
6ROLFLWXGGHHQYtRGHpDTXHWHDO
(0$&
TXD %\WHDVHUWUDQVPLWLGR
TXACK
&RQILUPDFLyQGHO(0$&pDUD
FRPLHQ]RGHWUDQVPLVLyQ
TXUNDRN
,QGLFDDO(0$&TXHGHbH
GHVFDUWDUHOpDTXHWHTXHVH
HQFXHQWUDWUDQVPLWLHQGR
CLK
5HORjpURYHQLHQWHGHOFKLpGH
FDpDItVLFD
)LJ$UTXLWHFWXUDGHOpUR\HFWR

)LJ'DWDpDWKGHODVROXFLyQ
(O8&%HVXQDXGLRFyGHFHVWpUHRGHbLWVFX\D
LQWHUID] GLJLWDO FXPpOH FRQ OD HVpHFLILFDFLyQ $&
5HYLVLyQ >@ OD FXDO GHILQH D OD $& FRPR XQD
LQWHUID] VHULDO GLJLWDO bLGLUHFFLRQDO HQFDUJDGD GH FRQHFWDU
XQ$&DXGLRFRGHFFRQXQ$&FRQWUROOHU
(QOD7DbODVHGHVFULbHXQVXbFRQjXQWRGHODVVHxDOHV
GHO8&%TXHVRQGHLQWHUpVpDUDQXHVWUDDpOLFDFLyQ
/DLQWHUID]VHULDOHQWUHHODXGLRFRGHF\HOFRQWUROOHUVH
HVWUXFWXUD HQ IUDPHV WUDPDV GH VORWV UDQXUDV
PXOWLpOH[DGRV HQ HO WLHPpR VRbUH OD VHxDO
AC97BSDATABIN&DGDIUDPHWLHQHbLWVGRQGHVyORHO
VORWHVGHbLWVORVUHVWDQWHVVORWVVRQGHbLWV
(O VORW WUDQVpRUWD LQIRUPDFLyQ TXH LQGLFD OD
GLVpRQLbLOLGDG GH ODV PXHVWUDV GLJLWDOL]DGDV HQ ORV VORWV
FRPRDVtWDPbLpQ GLVpRQLbLOLGDG GHOFRGHFHQ JHQHUDO/RV
bLWV FRUUHVpRQGLHQWHV D OD PXHVWUD GH OD VHxDO GH
PLFUyIRQRHVWiQDORjDGRVHQHOVORW
(ObLWFORFNGHODLQWHUID]$&HVGH0bLWVHJ
GHWHUPLQDGRpRUHO8&%HOFXDOXWLOL]DXQDIUHFXHQFLD
GHPXHVWUHRGHN+]
2.2.InterfazEthernet
/D HVpHFLILFDFLyQ (WKHUQHW ,((( HV XQ HVWiQGDU GH
UHGHV GH FRPpXWDGRUDV GHILQH ODV FDUDFWHUtVWLFDV GH
FDbOHDGR \ VHxDOL]DFLyQ GH QLYHO ItVLFR \ ORV IRUPDWRV GH
WUDPDVGHGDWRVDQLYHOFDpDGHHQODFHUHIHUHQWHVDOPRGHOR
26,>@
7DOFRPRVHRbVHUYDHQOD)LJODLPpOHPHQWDFLyQGH
OD LQWHUID] ,((( 0bpV HVWi bDVDGD HQ HO FKLp
'3 >@ UHVpRQVDbOH GH OD FDpD ItVLFD \ HO PyGXOR
(0$&>@,3FRUH

HPbHbLGRHQOD)3*$;LOLQ[9LUWH[
HQFDUJDGRGHODVXbFDpDGHDFFHVRDOPHGLR
(QOD7DbODVHGHVFULbHXQVXbFRQjXQWRGHODVVHxDOHV
GHO(0$&TXHVRQGHLQWHUpVpDUDODDpOLFDFLyQ
/DIUHFXHQFLDGHRpHUDFLyQGHO(0$&HVWiGHWHUPLQDGD
pRUHOFKLp'3TXHpDUDXQDWDVDGHWUDQVIHUHQFLDGH
0bpVUHVXOWDGH0+](VWDHVODIUHFXHQFLDGHOD
VHxDO CLK YHU 7DbOD XWLOL]DGD pDUD OD WUDQVIHUHQFLD GH
ORVpDTXHWHV(WKHUQHWHQHO,3FRUH(0$&
3.DISEfO
3DUD HO GLVHxR GH OD VROXFLyQ VH XWLOL]y OD WpFQLFD GH
)60' )LQLWH 6WDWH 0DFKLQH ZLWK 'DWDpDWK >@ /DV

6HPLFRQGXFWRU,QWHOOHFWXDO3URpHUW\FRUH
14

)LJ)60'$&FRQWUROOHU

)LJ)60GHO$&FRQWUROOHU
)60' FRPbLQDQ XQD PiTXLQD GH HVWDGRV FRQWURO pDWK
FRQ RWUR FLUFXLWR VHFXHQFLDO 'DWD 3DWK 'LFKD WpFQLFD
FRQVLVWHHQGLVHxDUHQXQDpULPHUDIDVHHOFDPLQRGHGDWRV
HQbDVHDODVRpHUDFLRQHVUHTXHULGDV\OXHJRODPiTXLQDGH
HVWDGRVTXHFRQWURODGLFKRFDPLQR
(Q QXHVWUR pUR\HFWR ORV GDWRV pURYHQLHQWHV GHO FRGHF
OOHJDQHQIRUPDVHULDODWUDYpVGHODVHxDOAC97BSDATABIN
YHU 7DbOD \ GHbHQ HQWUHJDUVH b\WH D b\WH DO (0$& D
WUDYpV GH OD VHxDO TXD YHU 7DbOD (VWR LPpOLFD XQD
FRQYHUVLyQ VHULHpDUDOHOR GH D bLWV pDUD OR FXDO VH
GHFLGLy XWLOL]DU XQ VKLIW UHJLVWHU UHJLVWUR GHVpOD]DPLHQWR
GH XQb\WH &RPRODVPXHVWUDVGLJLWDOL]DGDVVRQGH bLWV
VHGHFLGLyGLYLGLUODVHQWUHVb\WHVFRQVHFXWLYRVGRQGHORV
bLWV PHQRV VLJQLILFDWLYRV GHO WHUFHU b\WH QR pRVHHQ
LQIRUPDFLyQ~WLO
/DV PXHVWUDV GLJLWDOL]DGDV GH PLFUyIRQR GHbHQ
DOPDFHQDUVH WHPpRUDOPHQWH KDVWD FRQIRUPDU XQ pDTXHWH
(WKHUQHW 6H GHFLGLy DORjDU ORV b\WHV pURYHQLHQWHV GHO VKLIW
UHJLVWHUHQXQDPHPRULD),)2ILUVWLQILUVWRXW
/D ),)2 pHUPLWH TXH VH DOPDFHQHQ QXHYDV PXHVWUDV DO
PLVPR WLHPpR TXH VH WUDQVPLWHQ ORV pDTXHWHV GLVpRQLbOHV
(O DQFKR HQ bLWV GH OD ),)2 HV ILjDGR pRU ORV bLWV GH OD
VHxDO 7;' GHO (0$& PLHQWUDV TXH OD pURIXQGLGDG R
FDQWLGDG GH b\WHV D DOPDFHQDU GHbH VHU WDO TXH pHUPLWD
JXDUGDU DO PHQRV XQ pDTXHWH FRPpOHWR \ ODV PXHVWUDV
DGLFLRQDOHV PLHQWUDV HO PLVPR VH WUDQVPLWH 6H GHFLGLy
XWLOL]DU pDTXHWHV GH WDPDxR ILjR (Q OD VHFFLyQ VH
GHWHUPLQD HO WDPDxR GH OD ),)2 WRPDQGR HQ FXHQWD ODV
YHORFLGDGHV GH ODV LQWHUIDFHV \ OD FDQWLGDG GH pDTXHWHV D
DOPDFHQDUORFDOPHQWH
/RV pDTXHWHV (WKHUQHW GHbHQ LQFOXLU HQFDbH]DGRV GH
ODVFDpDVGHHQODFHUHG\WUDQVpRUWH$WDOILQVHLQFOX\HHQ
HO GDWDpDWK XQ PXOWLpOH[RU TXH pHUPLWH LQVHUWDU VRbUH OD
VHxDOTXDORVb\WHVGHOHQFDbH]DGR\OXHJRORVb\WHVGHOD
),)2 6H GHFLGLy LPpOHPHQWDU HQFDbH]DGRV ILjRV
DOPDFHQDQGR ORV PLVPRV HQ XQD PHPRULD (Q OD )LJ VH
LOXVWUDHOGDWDpDWKGHODVROXFLyQ

'DGR TXH ODV VHxDOHV AC97BSDATABIN \ TXD


pHUWHQHFHQ D GRPLQLRV GH UHORj GLVWLQWRV GHFLGLPRV
LPpOHPHQWDU XQD PiTXLQD GH HVWDGRV GLIHUHQWH pDUD FDGD
XQR GH ORV GRPLQLRV GH UHORj &RPR FRQVHFXHQFLD GH HVWD
GHFLVLyQTXHGDURQGHWHUPLQDGRVGRVPyGXORVHOPyGXORGH
PXHVWUHR FRQ HO VKLIW UHJLVWHU \ HO PyGXOR LQWHUID]
(WKHUQHWFRQHOPXOWLpOH[RU\ODPHPRULDGHHQFDbH]DGRV
TXHGDQGR pRU UHVROYHUOD DGDpWDFLyQHQWUHDPbRVGRPLQLRV
GHUHORj
(Q >@ VH pURpRQHQ WUHV WLpRV GH VROXFLRQHV pDUD
UHDOL]DUODLQWHJUDFLyQGHPyGXORVGHGLVWLQWRVGRPLQLRVGH
UHORj 8QD GH HVWDV VROXFLRQHV HV OD XWLOL]DFLyQ GH XQD
PHPRULD ),)2 DVLQFUyQLFD GRQGH ORV GDWRV HQWUDQWHV VRQ
HQFRODGRVDLQWHUYDORVGHWLHPpRDUbLWUDULR \H[WUDtGRVpRU
HOODGRUHFHpWRUHQODPHGLGDTXHpXHGDpURFHVDUORV
)LJ0yGXORV\FRPpRQHQWHVGHOpUR\HFWR
15

)LJ)60')UDPHU

)LJ)60)UDPHU
)LJ0yGXORGHDGDpWDFLyQGHIUHFXHQFLD
'HODQWHULRUDQiOLVLVVHFRQFOX\yHQXQDVROXFLyQGHWUHV
PyGXORV
0yGXOR GH PXHVWUHR FRPpXHVWR pRU HO FKLp
8&% \ OD OyJLFD GH FRQWURO D OD TXH
GHQRPLQDPRV$&FRQWUROOHU
0yGXOR GH DGDpWDFLyQ GH IUHFXHQFLDV \
HPpDTXHWDGRFRPpXHVWRpRUOD),)2DVLQFUyQLFD
0yGXOR LQWHUID] (WKHUQHW ,((( 0bV
FRPpXHVWR pRU HO FKLp '3 3+Y HO ,3 FRUH
(0$& \ OyJLFD GH FRQWURO D OD TXH GHQRPLQDPRV
)UDPHU
(Q OD )LJ VH LOXVWUD OD DUTXLWHFWXUD GH OD VROXFLyQ
GHWDOODQGR HQ FDGD PyGXOR VXV FRPpRQHQWHV LQWHUQRV (Q
FRORUJULVVHVHxDODQORVFRPpRQHQWHVLPpOHPHQWDGRVVRbUH
OD)3*$;LOLQ[9LUWH[
4. IMPLEMENTACIN
(QHVWDVHFFLyQVHGHVFULbHODLPpOHPHQWDFLyQGHFDGDXQR
GHORVPyGXORVHQOD)3*$
4.1.Mdulodemuestreo:AC`97controller
&RPR H[pOLFDPRV HQ OD VHFFLyQ GH GLVHxR HO GDWDpDWK GH
HVWHPyGXORHVWiFRPpXHVWRpRUHOVKLIWUHJLVWHUGHbLWV
/D)LJPXHVWUDOD)60'FRPpOHWD\OD)LJGHWDOOD
OD PiTXLQD GH HVWDGRV $ FRQWLQXDFLyQ YDPRV D GHVFULbLU
UHVXPLGDPHQWHVXIXQFLRQDPLHQWR
S1HVHOHVWDGRLQLFLDOGHHVpHUDGHFRPLHQ]RGHIUDPH
'XUDQWHORVHVWDGRV S2DS5OD)60HVWipURFHVDQGRHO
VORW UHFRUGDPRVTXHGLFKR VORW FRQWLHQHODVLQGLFDFLRQHV
GH WUDPD YiOLGD \ VORW YiOLGR 'XUDQWH HVRV HVWDGRV HVWi
DFWLYDODVHxDOAC97BSYNC
S6 \ S7 VRQ ORV HVWDGRV GRQGH VH UHFLbH OD PXHVWUD GH
DXGLR 'XUDQWH HVWRV HVWDGRV HVWi DFWLYD OD VHxDO GH FDUJD
SRBLOADGHOVKLIWUHJLVWHU(Q6VHDFWLYDODVHxDO/2$'
GHOD),)2pDUDHQFRODUHOb\WHGHOVKLIWUHJLVWHU
S8HVHOHVWDGRGHUHVWDUWGHOD),)2FRPRUHVpXHVWDDOD
VHxDOFULLpURYHQLHQWHGHOD),)2
4.2.MdulointerfazEthernet:Framer
&RPR H[pOLFDPRV HQ OD VHFFLyQ GH GLVHxR HO GDWDpDWK GH
HVWH PyGXOR HVWi FRPpXHVWR pRU XQD PHPRULD GH
HQFDbH]DGRV\XQPXOWLpOH[RUTXHVHOHFFLRQDHQWUHORVGDWRV
TXHpURYLHQHQGHHVWD\ORVGHOD),)2
/D)LJPXHVWUDOD)60'FRPpOHWD\OD)LJGHWDOOD
OD PiTXLQD GH HVWDGRV $ FRQWLQXDFLyQ YDPRV D GHVFULbLU
UHVXPLGDPHQWHVXIXQFLRQDPLHQWR
S1HVHOHVWDGRLQLFLDOGRQGHVHHVpHUDODVHxDO5($'Y
TXHLQGLFDTXHOD),)2WLHQHXQpDTXHWHGLVpRQLbOH
S2 HV HO HVWDGR GRQGH VH VROLFLWD DFWLYDQGR OD VHxDO
TXDJLD \ VH HVpHUD FRQILUPDFLyQ GH OD (0$& VHxDO
7;$&.pDUDODWUDQVPLVLyQ
S3 HV HO HVWDGR GRQGH VH WUDQVPLWH HO HQFDbH]DGR GHO
pDTXHWHVHOHFFLRQiQGRVHDWUDYpVGHOPXOWLpOH[RU \HQ S4
VH WUDQVPLWH ORV b\WHV pURYHQLHQWHV GH OD ),)2
VHOHFFLRQiQGRVH D WUDYpV GHO PXOWLpOH[RU DFWLYDQGR OD
VHxDOREAD
S5HVHOHVWDGRGHHUURUVHxDOTXUNDRNDFWLYD
16

)LJ7LHPpRVGHOOHQDGR\YDFLDGRGHOD),)2
4.3.Mdulodeadaptacindefrecuenciasy
empaquetado.
7DQWR HO PyGXOR GH PXHVWUHR FRPR HO PyGXOR LQWHUID]
,((( pRVHHQ XQD IUHFXHQFLD GH RpHUDFLyQ pURpLD
GHWHUPLQDGD pRU VXV LQWHUIDFHV $& H ,(((
UHVpHFWLYDPHQWHORFXDOLPpOLFDODQHFHVLGDGGHLQWHJUDFLyQ
GHVHxDOHVHQWUHGRVGRPLQLRVGHUHORjGLVWLQWRV
(V UHVpRQVDbLOLGDG GHO PyGXOR GH DGDpWDFLyQ GH
IUHFXHQFLDV \ HPpDTXHWDGR UHDOL]DU OD WUDQVIHUHQFLD GH
LQIRUPDFLyQ HQWUH DPbRV GRPLQLRV GH UHORj 7DO FRPR
GHVFULbLPRV HQ OD VHFFLyQ VHOHFFLRQDPRV XQD ),)2
DVLQFUyQLFDpDUDWDOILQ
(QOD)LJVHLOXVWUDHOLQWHUFDPbLRGHVHxDOHVHQWUHORV
GRV PyGXORV GHVFULpWRV DQWHULRUPHQWH NyWHVH OD
pHUWHQHQFLDGHODVVHxDOHVDGLVWLQWRVGRPLQLRVGHUHORj
(OPyGXORGHDGDpWDFLyQGHIUHFXHQFLDV\HPpDTXHWDGR
VH bDVD HQ OD XWLOL]DFLyQ GH XQD ),)2 DVLQFUyQLFD
FRQILJXUDbOH HQ OD )3*$ ;LOLQ[ 9LUWH[ 'LFKD )3*$
pHUPLWH OD LPpOHPHQWDFLyQ GH ),)2V VLQFUyQLFDV \
DVLQFUyQLFDV D WUDYpV GH OyJLFD GHGLFDGD \ %5$0V
OLbHUDQGR UHFXUVRV pDUD VHU XWLOL]DGRV HQ HO UHVWR GH OD
OyJLFD/RV%ORFN 5$0 %5$0VRQ bORTXHV GH PHPRULD
5$0HPbHbLGRVHQHOLQWHULRUGHOD)3*$
3DUDFRQILJXUDUOD),)2DVLQFUyQLFDWXYLPRVTXHGHFLGLU
OD FDQWLGDG GH pDTXHWHV D DOPDFHQDU \ HO WDPDxR GH ORV
PLVPRV
5HVpHFWR GHO WDPDxR GHO pDTXHWH \D TXH QR WHQtDPRV
UHVWULFFLRQHV HQ FXDQWR D ORV UHWDUGRV GH WUDQVPLVLyQ HQWUH
HOORV GHFLGLPRV FRQVLGHUDU HO Pi[LPR WDPDxR TXH OD
HVpHFLILFDFLyQ ,((( GHILQH pXHV GH HVWD PDQHUD
WHQHPRVHOPHQRURYHUKHDGHQODWUDQVPLVLyQPHQRVb\WHV
GHHQFDbH]DGRV6HHOLJLyHOQ~PHURpRUVHUP~OWLpOR
GHFHUFDQRDOPi[LPRGHb\WHVpDUDODFDUJD~WLO
3RU RWUR ODGR YHULILFDPRV ORV WDPDxRV GLVpRQLbOHV pDUD
FRQILJXUDU OD PHPRULD ),)2 HQ OD )3*$ ;LOLQ[ 9LUWH[
VHOHFFLRQDQGR XQD ),)2 GH . [ bLWV TXH pHUPLWH
DOPDFHQDUPDVGHXQpDTXHWHORFXDODpRUWDXQDWROHUDQFLDD
GLVWLQWRVUHWDUGRVHQWUHpDTXHWHV
'DGDVHVWDVFRQVLGHUDFLRQHVHQOD)LJVHFDOFXODQORV
WLHPpRV GH OOHQDGR \ YDFLDGR FRQ ORV TXH WUDbDjD OD ),)2
/DVHxDODOPRVWIXOOGHOD),)2VHFRQILJXUypDUDDFWLYDUVH
DOVXpHUDUHOWDPDxRGHXQpDTXHWHXWLOL]iQGRVHFRPRVHxDO
GH5($'YpDUDHOPyGXOR(0$&
5.VERIFICACIN
3DUDFRPpURbDUHOFRUUHFWRIXQFLRQDPLHQWRGHFDGDXQRGH
ORV PyGXORV VH XWLOL]y HO PpWRGR GH OD VLPXODFLyQ /D
KHUUDPLHQWDXWLOL]DGDIXH0RGHO6LP>@
6H FRQVWUX\y XQ WHVWbHQFK pRU FDGD PyGXOR GLVHxDGR
$& &RQWUROOHU )UDPHU ),)2 GRQGH FDGD WHVWbHQFK
FRQVWDGH8878QLW8QGHU7HVW\'5,9(5HQFDUJDGRGH
OD JHQHUDFLyQ GH ORV HVWtPXORV (O PRQLWRUHR pDUD
FRPpURbDU HO FRUUHFWR IXQFLRQDPLHQWR VH KL]R DQDOL]DQGR
YLVXDOPHQWHHOGLDJUDPDGHWLHPpRVJHQHUDGR
5.1.TestbenchAC97controller.
3DUDJHQHUDUORVHVWtPXORVQRVbDVDPRVHQODHVpHFLILFDFLyQ
GHO pURWRFROR $& >@ \ QR HQ OD LPpOHPHQWDFLyQ GHO
$&FRQWUROOHU
/D )LJ PXHVWUDODVLPXODFLyQ pDUD HOFDVRGH WUDPD
YiOLGD \ WLPH VORW YiOLGR 2bVHUYDU TXH OD VHxDO loaa,
XWLOL]DGDFRPRZULWHHQDbOHHQOD),)2VHDFWLYDWUHVYHFHV
2bVHUYDU WDPbLpQ TXH OD VHxDO aataBout FRQWLHQH ORV WUHV
b\WHVFRUUHVpRQGLHQWHVDODPXHVWUDGHPLFUyIRQR
2WURV FDVRV GH pUXHbD UHDOL]DGRV IXHURQ WUDPD QR
YiOLGDWLPHVORWQRYiOLGR\),)2IXOO

)LJ'LDJUDPDGHWLHPpRVGHOWHVWbHDQFKVHFFLyQ
17

)LJ'LDJUDPDGHWLHPpRVGHOWHVWbHQFKVHFFLyQ
7DbOD8WLOL]DFLyQGHOGLVpRVLWLYR
&RPpRQHQWH
3RUFHQWDjHGHXWLOL]DFLyQ
SliceFlipFlops
4inputLUTs
occupieaSlices
bonaeaIOBs
BUFG/BUFGCTRLs
FIFO16/RAMB16s
EMACs
5.2.TestbenchFramer.
$OUHDOL]DUODVLPXODFLyQGHWHFWDPRVTXHODVHxDOenableQR
FXPpOtDFRQODHVpHFLILFDFLyQGHbLGRDTXHHVWDbDUHWUDVDGD
XQFLFORFRQUHVpHFWRDODVHxDOack
3DUD UHFWLILFDU OD LPpOHPHQWDFLyQ VH GHFLGLy LQWURGXFLU
XQD pHTXHxD PRGLILFDFLyQ HQ OD )60 TXH PRGHOL]D HO
FRQWURO GHO )UDPHU FRQYLUWLHQGR OD VHxDO HQDbOH HQ XQD
VHxDOGHVDOLGDGHOWLpR0HDO\GHIRUPDWDOTXHODVHxDOack
PRGLILFD LQVWDQWiQHDPHQWH OD VHxDO HQDbOH 2bVHUYDU HQ OD
)LJODVVHxDOHVack\enable.
5.3.TestbenchdelaFIFOasincrnica.
(Q HVWH WHVWbHQFK VH LQVWDQFLy OD ),)2 DVLQFUyQLFD \ VH
JHQHUDURQ ORV HVWtPXORV VLPXODQGR ODV LQWHUIDFHV GH ORV
PyGXORV$&FRQWUROOHU\)UDPHU
6H GHWHFWy TXH pDUD HO FDVR GH pUXHbD GH ),)2 IXOO HO
$& FRQWUROOHU HVWDbD PDQWHQLHQGR OD VHxDO GH UHVWDUW
DFWLYD GXUDQWH XQ VROR FLFOR GH UHORj QR FXPpOLHQGR OD
HVpHFLILFDFLyQ GHO ),)2 ,3 &RUH JHQHUDWRU GH DO PHQRV
FLFORV GH UHORj &RQVLJXLHQWHPHQWH VH PRGLILFy HO PyGXOR
$&FRQWUROOHUpDUDTXHFXPpODFRQHVWHUHTXHULPLHQWR
6.SINTESIS
6HVLQWHWL]yWRGDODDpOLFDFLyQ\HQOD7DbODVHPXHVWUDOD
XWLOL]DFLyQGHOGLVpRVLWLYR
7. CONCLUSIN
6H GLVHxy FRPpOHWDPHQWH OD DpOLFDFLyQ GH WUDQVIHUHQFLD GH
PXHVWUDV GH DXGLR VRbUH OD UHG XWLOL]DQGR OD PHWRGRORJtD
pURpXHVWD HQ >@ pDUD HO GLVHxR GH VLVWHPDV GLJLWDOHV 'HO
DQiOLVLV GH OD DpOLFDFLyQ VH RbWXYLHURQ GRV PyGXORV
FRPpOHWRV GLVHxDGRV HQ )60' \ XQD LQWHUID] HQWUH HOORV
FRPpXHVWDpRUHOPyGXORTXHFRQWLHQHOD),)2DVLQFUyQLFD
'H OD HVpHFLILFDFLyQ FRPpOHWD GHO VLVWHPD VH pDVy D OD
FRGLILFDFLyQ HQ IRUPDWR +'/ D VX YHULILFDFLyQ \ D ODV
pUXHbDVHQHONLWGHGHVDUUROOR
8.REFERENCES
>@ 3RQJ 3 &KX )3*$ 352727Y3,N* %Y 9(5,/2*
(;$03/(6FDpLWXOR)60'ppJDQ
>@ ,((( ,((( 6WDQGDUG IRU ,QIRUPDWLRQ
WHFKQRORJ\6pHFLILF UHTXLUHPHQWV 3DUW &DUULHU 6HQVH
0XOWLpOH $FFHVV ZLWK &ROOLVLRQ 'HWHFWLRQ &06$&'
$FFHVV0HWKRGDQG3K\VLFDO/D\HU6pHFLILFDWLRQ
>@ 9LUWH[)3*$8VHU*XLGH8*Y$pULO
>@ ;LOLQ[9LUWH[);(YDOXDWLRQ.LW8VHU*XLGH$'6
$YQHW,QF
>@ $XGLR9LGHR 0RGXOH XVHUV JXLGH $'6
$YQHW,QF
>@ 8&% $XGLR FRGHF ZLWK WRXFK VFUHHQ FRQWUROOHU DQG
pRZHUPDQDJHPHQWPRQLWRU5HYJXQH3URGXFW
'DWD
>@ $XGLR &RGHF 5HYLVLRQ 5HYLVLRQ ,QWHO $pULO

>@ 26,5HIHUHQFH0RGHO7KH,620RGHORI$UFKLWHFWXUHIRU
2pHQ 6\VWHPV ,QWHUFRQQHFWLRQ +XbHUW ZLPPHUPDQQ ,(((
7UDQVDFWLRQVRQ&RPPXQLFDWLRQVYROQR$pULO
pp
>@ '3 'V3+Y7(5 ,, 6LQJOH (WKHUQHW
7UDQVFHLYHU 'DWDVKHHW NDWLRQDO 6HPLFRQGXFWRU
&RUpRUDWLRQ
>@ 9LUWH[ )3*$ (PbHGGHG 7UL0RGH (WKHUQHW 0$& 8VHU
*XLGH8*Y)HbUXDU\
>@ 6WHYH .LOWV $GYDQFHG )3*$ 'HVLJQ $UFKLWHFWXUH
,PpOHPHQWDWLRQ DQG 2pWLPL]DWLRQ FDpLWXOR &ORFN
GRPDLQVpp
>@ KWWpZZZPHQWRUFRPpURGXFWVIYPRGHOVLPXpORDGGDWDV
KHHWpGI
18
USE OF SELF-CHECKING LOGIC TO MINIMIZE THE EFFECTS OF SINGLE EVENT
TRANSIENTS IN SPACE APPLICATIONS
Juan Ortega-Ruiz and Eduardo Boemo
School of Computer Engineering,
Universidad Autonoma de Madrid
Ctra. de Colmenar Km 15, 28049 Madrid
email: juan.ortega.ruiz@gmail.com, eduardo.boemo@uam.es
ABSTRACT
The use of Self-Checking circuits has been explored as a
mean to minimize the effects of Single Event Transients in
combinational logic. Different types of circuits have been
described in VHDL RTL, synthesized and simulated at gate
level. During their gate-level simulation faults were injected
and their outputs were checked against a fault-free simula-
tion. Results show that Self-Checking circuits can be used
to either detect or correct errors in combinational logic, min-
imizing the effects of SETs in combinational logic.
1. INTRODUCTION
FPGAs are increasingly utilized in space applications: the
recognized virtues of that technology like high-density, ver-
satility, availability are also attractive in a sector that suffers
a lack of varied components as well as increasingly budget
restrictions. However, the reprogrammability - other dis-
tinctive characteristic of these circuits - can be a drawback
in safe-critical missions. In effect, the intrinsic possibility
of correct errors during design time, can produce an unno-
ticed relax of the strict design and verication rules usually
applied during the development of an masked ASIC [1]
Radiation can affect FPGAs by producing soft or hard
errors. The rst ones are represented by upsets in ip-ops,
latches or SRAM cells (SEU, Single Event Upset), the ac-
tivation of disabled functionality (SEFI, Single Event Func-
tional Interrupt), or the generation of glitches in combina-
tional logic (SET, Single Event Transient). Permanent dam-
age to the device, like degradation by accumulated radiation
(Total Dose) or by particle impacts which produce imme-
diate hardware damage, like latch-up (SEL, Single Event
Latchp) are examples of hard-errors.
This paper deals with the effects of Single Event Tran-
sients (SET) in the user plane of the logic of the FPGA, and
the utilization of self-checking circuits to minimize it. SET
in combinational logic might produced glitches which might
be captured by ip-ops at the end of the logic cone. Such
wrong values would propagate though the logic, with unpre-
dictable functional results.
Probability of capturing a SET increases with clock fre-
quency, because duration of the glitches and the clock period
become similar. Additionally, conventional TMR (Triple
Module Redundant) ip-ops can do little to avoid capturing
such glitches. As a rule-of-thumb, SET becomes a problem
at frequencies, above 150MHz, in space-applied FPGAs.
This paper is organized in the following way. In section
2, a denition of totally self-checking circuits is given. Sec-
tion 3 describes an architecture for a totally self-checking
network and its properties. In section 4 a set of 4 totally
self-checking circuits are evaluated, to study the properties
which make them self-checking. Section 5 summarizes the
conclusions, and nally, section 6 acknowledges this study.
2. TOTALLY SELF-CHECKING CIRCUITS
A circuit whose output is encoded in some error detecting
code is called a self-checking circuit [2]. Self-checking cir-
cuits has properties of Self-Testing and Fault-secure, which
were formally dened by [3, 4]. The following denitions
are extracted from [5, 6, 4].
X is the set of all possible input words.
X
c
is the set of valid input codes. X
c
X
Z is the set of all possible output words
Z
c
is the set of valid output codes. Z
c
Z
F is the set of all possible faults. F is null fault.
A function z : (X, F) Z denes a circuit.
Fault-Secure A circuit is fault-secure for an input set I
X
c
and a fault set F
s
F if the circuit, in the pres-
ence of a fault, produces either the right output code
word or an output non-code word, but never an incor-
rect output code word [6, 4].
19
Fig. 1. Totally self-checking circuit example.
Self-Testing A circuit is self-testing for an input set X
c
, and
a fault set F
t
F if for every fault in the fault set, it
exists an input code word, for which the circuit pro-
duces an output non-code word in the presence of the
fault [6, 4].
Totally Self-Checking a circuit is Totally Self-Checking (TSC)
if the circuit is self-testing for X
c
X, F
t
F, and
the circuit is fault-secure for X
c
X, F
s
F [6, 5].
It is assumed that F
s
F
t
. Fig. 1 shows the relation-
ship between different code and fault spaces in a TSC
circuit.
Code-Disjoint Acircuit is code-disjoint if input code words
are mapped into output code words, and input non-
code words are mapped into output non-code words
[5].
Fig. 1 shows that faults from F
s
produce either the right
output code word, like z(x1, f2), z(x2, f2), or a output non-
code word, like z(x3, f2), but it never produces an incorrect
output code word. However, faults in F
t
might produce in-
correct output code words, like z(x2, f1), but all faults in F
t
can be detected because for every fault in F
t
it exists an in-
put which produces a output non-code word in the presence
of the fault, like z(x1, f1).
Unfortunately, in normal operation it is not guaranteed
that all valid input codes are applied [7], nevertheless, totally
self-checking property guarantees that an output code word
is really a good one, and an output non-code word indicates
the presence of a fault.
3. TOTALLY SELF-CHECKING NETWORKS
Totally self-checking networks are built by totally self-checking
functional circuits, which are monitored by totally self-checking
checkers, whose outputs are coded in some error detecting
code [3]. The code 1-out-of-2 is commonly used because
m-out-of-n codes are able to detect unidirectional multiple
errors. Authors [6, 5] explain m-out-of-n coding theory.
In fault-less operation, the functional circuit transforms
input code words from the input code space X
cf
{0, 1}
n
Fig. 2. Totally self-checking network.
into output code words in the output code space Z
cf

{0, 1}
m
.
The checker circuit transforms input code words from
Z
cf
into code words of the checker output code space Z
cc

{(0, 1), (1, 0)}. The checker constitutes a code-disjoint cir-
cuit. Fig. 2 shows both the functional and the checker cir-
cuits of a TSC network.
By observing the output of the checker, it is possible to
detect any fault in the network, but it is not possible to decide
whether the fault is in the functional circuit or in the checker
itself [5]
4. CASES OF STUDY
The following circuits have been implemented and checked
for the properties of totally self-checking
1. Single bit parity checker
2. Two rail checker
3. 2-out-of-5 code checker.
4. Berger prediction code checker
For each checker the evaluation consisted on the follow-
ing steps:
1. build VHDL RTL model of the checker.
2. synthesize it using a generic target technology. Pro-
duce two implementations: one based in xor gates,
another based in and-or gates.
3. simulate the synthesized gate-level model of the checker
with fault injection. Simulation consists on the fol-
lowing steps:
(a) generate set of all possible faults. Only single
faults will be considered
(b) generate the set of valid input code words.
(c) for every fault of the fault set, simulate the whole
set of valid input code words.
(d) compare results with the fault-free checker sim-
ulation searching for errors.
4. catalog the fault effects
20
x(4:0)
x(1)
x(0)
x(2)
x(3)
x(4)
z(0)
z(1)
z(1:0)
Fig. 3. Single bit parity checker, xor based.
If for a certain input pattern, the fault produces a valid
output code word which is not the expected one, then the
circuit is not fault-secure.
If the fault does not produce any output non-code word
for any input in the set of valid input code words, then the
circuit is not self-testing.
The dependency of the checkers with respect to its syn-
thesis implementation is also evaluated. Both synthesis netlists,
xor and and-or based netlists are simulated with fault in-
jection.
4.1. Single bit parity checker
A single bit parity checker generates a 2-bit vector output
whose value is {10, 01} when the input vector has the
right parity, and {00, 11} when the input vector has the
wrong parity. Two different implementations are proposed.
Fig. 3 shows a xor-based implementation of a simple parity
checker. Fig 4 describes an and-or-based implementation.
1. The input code space is a {1, 0}
5
2. The output code space is 1-out-of-2 code.
3. Fault set consist on inversions, stuck-at-0 and stuck-
at-1 in input bits and gate outputs.
Table 1 shows the results of the simulations with fault
injection.
1. For single errors the circuit is self-testing: there is al-
ways an input pattern which produces an output non-
code word, or an unidirectional error. There are no
undetected faults
2. For single errors, the circuit is fault-secure: no faults
produces unexpected valid output codes, or a bidirec-
tional error. A fault does not convert 10 into 01
and vice versa.
The parity checker is then Totally Self Checking.
Results are identical regardless of the two implementa-
tions proposed.
z(1:0)
z(0)
z(1)
x(1)
x(0)
x(4)
x(3)
x(2)
x(4:0)
Fig. 4. Single bit parity checker, and-or based.
Table 1. Single bit parity checker results.
Fault Loc #Pat
Unidir
Err
Bidir
Err
No
Det
Inv x(*) 16 16 0 0
SA0 x(*) 16 8 0 0
SA1 x(*) 16 8 0 0
Inv U* 16 16 0 0
SA0 U* 16 8 0 0
SA1 U* 16 8 0 0
4.2. Two-rail checker
A two-rail checker compares two complementary set of in-
put vectors of n-bits. Its output is coded in a 1-out-of-2 code.
The output is 01, 10 when the inputs are complemen-
tary, otherwise 11, 00.
Equations 1a and 1b describe a two-rail checker of 2 bits.
z
1
= x
1
y
0
+ x
0
y
1
(1a)
z
0
= x
1
x
0
+ y
1
y
0
(1b)
There are methods to build nbit two-rail checkers de-
scribed in [6, 3]. Fig. 5 shows two architectures for a two-
rail checker of 4 bits.
A 4-bit two rail checker, type tree B, shown in Fig. 6,
has been implemented for the evaluation.
Table 2 shows the results of the simulations with fault
injection.
1. For single errors the circuit is self-testing: there is
always an input pattern which produces an unidirec-
tional error, a non-code word. There are no unde-
tected faults
2. For single errors, the circuit is fault-secure: no faults
produces unexpected valid output codes, or a bidirec-
tional error. A fault does not convert 10 into 01
and vice versa.
The 4-bit two-rail checker is then Totally Self Checking.
21
Fig. 5. A 4-bit dual rail checker.
x(3:0)
y(3:0)
z(1:0)
z(0)
z(1)
y(3)
x(3)
y(2)
x(2)
x(1)
y(1)
x(0)
y(0)
Fig. 6. netlist of the 4-bit dual rail checker tree B.
4.3. 2-out-of-5 code checker
A m-out-of-n code checker checks whether the input word is
a m-out-of-n valid code word. For a m-out-of-n checker to
be self-testing the code words must contain the same number
of 0

s and 1

s, that is, the input code space must be a k-out-


of-2k code [4].
However, general m-out-of-n code checkers, where n =
2m, can be done by translating the code words into k-out-of-
2k code code words by using code-disjoint translators [4, 5].
A k-out-of-2k code checker is built by dividing the input
code word in two parts, which are processed independently
by two circuits, which together generate a 1-out-of-2 output
code space.
The 2-out-of-5 code described in [5] has been evaluated.
The code has been translated into a 3-out-of-6 code, which
is nally checked. Equation 2 denes the translation [5], and
equation 3 denes the checker.
Table 2. 4-bit two-rail checker results.
Fault Loc #Pat
Unidir
Err
Bidir
Err
No
Det
Inv x/y(*) 16 16 0 0
SA0 x/y(*) 16 8 0 0
SA1 x/y(*) 16 8 0 0
Inv U* 16 [12,16] 0 0
SA0 U* 16 [4,8] 0 0
SA1 U* 16 [4,8] 0 0
Table 3. 2-out-of-5 checker results.
Fault Loc #Pat
Unidir
Err
Bidir
Err
No
Det
Inv x(*) 10 10 0 0
SA0 x(*) 10 4 0 0
SA1 x(*) 10 6 0 0
Inv U* 10 [3,10] 0 0
SA0 U* 10 [1,6] 0 0
SA1 U* 10 [1,7] 0 0
y
9
= x
4
x
3
y
8
= x
4
x
2
y
7
= x
4
x
1
y
6
= x
4
x
0
y
5
= x
3
x
2
y
4
= x
3
x
1
y
3
= x
3
x
0
y
2
= x
2
x
1
y
1
= x
2
x
0
y
0
= x
1
x
0
z
5
= y
9
+ y
8
+ y
7
+ y
6
+ y
0
z
4
= y
9
+ y
8
+ y
5
+ y
4
+ y
1
z
3
= y
9
+ y
7
+ y
5
+ y
3
+ y
0
z
2
= y
5
+ y
4
+ y
3
+ y
2
+ y
1
z
1
= y
7
+ y
6
+ y
3
+ y
2
+ y
1
z
0
= y
8
+ y
6
+ y
4
+ y
2
+ y
1
(2)
f
3
= (z
5
+ z
4
+ z
3
)(z
2
z
1
+ z
2
z
0
+ z
1
z
0
) + (z
5
z
4
z
3
)
g
3
= (z
2
z
1
z
0
) + (z
5
z
4
+ z
5
z
3
+ z
4
z
3
)(z
2
+ z
1
+ z
0
)
(3)
Table 3 shows the results of the simulations with fault
injection.
1. For single errors the circuit is self-testing: there is
always an input pattern which produces an unidirec-
tional error, a non-code word. There are no unde-
tected faults
2. For single errors, the circuit is fault-secure: no faults
produces unexpected valid output codes, or a bidirec-
22
Fig. 7. Berger prediction code checker.
tional error. A fault does not convert 10 into 01
and vice versa.
The implemented 2-out-of-5 checker is then Totally Self
Checking.
4.4. Berger Prediction Code Checker
Arithmetic operations can be done in Berger-coded data by
using prediction circuitry [8, 9].
Although the Berger check bits of the result are calcu-
lated with the arithmetic result itself, it is also possible to
predict the result Berger code [8].
Based on these assumptions, a Berger prediction code
checker can be built comparing the output Berger code and
predicted Berger code. Fig. 7 shows the Berger prediction
code checker architecture.
The comparator must be totally self-checking, which can
be implemented by a totally self-checking two-rail checker,
where one the inputs has been inverted.
The code checker fault tolerance is reduced to the fault
tolerance of the TSC two-rail checker. Results of a 4-bit
two-rail checker are summarized in table 2
5. CONCLUSION
Totally self-checking properties of four circuits have been
evaluated with positive results. Their properties make them
suitable to minimize SET effects in combinational circuits,
because the percentage of nets which remain SET sensitive,
with no SET protection, is overall reduced. Nets which are
part of TSC logic are less sensitive to SET, because in pres-
ence of a SET, the TSC logic either recovers, or indicates
an error at the end of the logic cone, which is normally con-
nected to registers. In opposition, normal logic in presence
of SET has no possibility to detect errors at the end of the
cone of logic. The number of nets affected by SETs which
produced undetected errors is then reduced.
Complex code checkers can be made totally self-checking
by using simple and smaller TSC checkers, like two-rail
checker, at a not very expensive cost in terms of area and
speed. However inputs and outputs must be coded in some
error detection code [5].
TSC combinational logic combined with TMR sequen-
tial logic, will increase reliability of SET sensitive systems,
specially when the operation frequency is similar to the ef-
fect of SET, when the probability for a glitch to be captured
in a TMR ip-op increases. In addition, the cases of study
described in this paper, demonstrated that the chosen imple-
mentation, either and-or-based or xor-based, have the same
TSC properties.
6. ACKNOWLEDGES
This work has been granted by the CICYT of Spain under
contract TEC2007-68074-C02-02/MIC
7. REFERENCES
[1] S. Habinc, Lessons learned from fpga developments, Gaisler
Research, Technical Report, September 2002.
[2] W. C. Carter and P. R. Schneider, Design of dynamically
checked computers, in IFIP, vol. 2. North-Holland, 1968,
pp. 878883.
[3] D. A. Anderson, Design of self-checking digital networks us-
ing coding techniques, Computer Science, University of Illi-
nois at Urbana-Champaign, 1971.
[4] D. A. Anderson and G. Metze, Design of totally self-checking
check circuits for m-out-of-n codes, in Proceedings of FTCS,
ser. FTCS, vol. 3, no. 25, IEEE. IEEE, 1995, pp. 244248.
[5] P. K. Lala, Self-Checking and Fault-Tolerant Digital Design,
D. E. M. Penrose, Ed. Morgan Kaufmann, 2001.
[6] J. F. Wakerly, Error Detecting Codes, Self-checking Circuits
and Applications, ser. Computer design and architecture series,
E. J. McCluskey, Ed. Elsevier North-Holland, Inc, 1978.
[7] F. Ozguner, Design of totally self-checking embedded two-
rail code checkers, in Electronic Letters, ser. Electronics Let-
ters, vol. 27, no. 4, IEEE. IEEE, 1991, pp. 382384.
[8] J.-C. Lo, S. Thanawastien, and T. R. N. Rao, An sfs berger
check prediction alu and its applications to self-checking pro-
cessor designs, IEEE Transactions on Computer-Aided De-
sign, vol. 11, no. 4, pp. 525540, April 1992.
[9] J. H. Kim, T. R. N. Rao, and G. L. Feng, The efcient de-
sign of a strongly fault-secure alu using a reduced berger code
for wsi processor array, in International Conference on Wafer
Scale Integration, 1993.
23

24
WIRELESS INTERNET CONFIGURABLE NETWORK MODULE

Mara Isabel Schiavon
Laboratorio de Microelectrnica
FCEIA, UNR
Rosario, Argentina,
bambi@fceia.unr.edu.ar,

Daniel Alberto Crepaldo
Laboratorio de Microelectrnica
FCEIA, UNR
Rosario, Argentina,
crepaldo@fceia.unr.edu.ar
Raul Lisandro Martin
Laboratorio de Microelectrnica
FCEIA, UNR
Rosario, Argentina,
rlmartin@fceia.unr.edu.ar



Abstract A field programmable logic devices (FPGA) wire-
less Internet configurable network node was developed. Inter-
nally, three modules can be differenced, an 802.11 compatible
transmitter/receiver, an ETHERNET compatible dedicated
communication module and other module for receiving field
sensors activity signals. Implementation over SPARTAN III
developing board results is presented with simulation results.
Keywords: wireless; reconfigurable network node; fpga
I. INTRODUCTION
A wireless Internet configurable network node imple-
mented with field programmable logic devices (FPGA) is
presented. The system has reconfiguration capability
achieved by the Internet connection in a wireless
ETHERNET local area network. [1] [2]
Each one of the net nodes contains three modules,
a sensor subsystem that receives activity signals from field
sensors and generates the data telegram, a dedicated
communication module that builds the message frame
to transmit and decodes the incoming frames, and
another to manager wireless data interchange identified as
TRANSMITTER/RECEIVER.
Node operation and configuration data are interchanged
via a wireless ETHERNET network and is achieved re-
motely using IP protocol [3]. The data interchange protocol
for wireless Ethernet networks with connectivity to
INTERNET are defined in IEEE 802.11 standard rules.
These rules are technology and internal structure independ-
ent. Rules minimum and necessary subset was selected.
Each network node has a physical address (MAC) and an IP
address.
II. DESCRIPTION
Block diagram of a typical net node is shown in Figure 1.
It shows six blocks. First block is a wireless ETHERNET com-
patible transmitter/receiver. Next three blocks constitute the
dedicated communication module: ETHERNET frames encod-
ing and decoding block (CODE/DECO ETHERNET), IP packets
encoding and decoding block (CODE/DECO IP) and a memory
block (ETHERNET DATA MEMORY). The last two blocks corre-
spond to the sensor subsystem, one (SENSOR MANAGER) to
receive data from sensors and to conform them for transmis-
sion and the other is a memory block to store configuration
parameters (CONFIGURATION MEMORY).

Figure 1. System node block diagram
A. TRANSMITTER/RECEIVER
The transmitter/receiver to be used in this application will
be a wireless ETHERNET IEEE 802.11 compatible transmit-
ter/receiver and its description runs out of the scope of this
paper
B. DEDICATED COMMUNICATION MODULE
ETHERNET frames encoding and decoding block
(CODE/DECO ETHERNET) is a bidirectional block to manage
data transmission and reception. As a receptor, it recog-
nizes, decodes and processes the incoming frame according
to ETHERNET rules. In data transmission, the reverse proc-
ess is managed. The internal block diagram is shown in fig-
ure 2.
Analyzer block (AT) block is designed as a finite state
machine with an initial state to select transmission or recep-
tion process. Each one of those process is accomplished by
its own state chain.
In the transmission process, when PACKET OUT OK signal
from the CODE/DECO IP is detected by AT block process to
conform the output message is started assembling memory
stored data telegram, destination/origin MAC and control bits.
If channel is busy the transmission is inhibited. When the
transmission medium is free, a signal (CCA INDICATION) is
generated by the transmitter. After a random backoff time to
prevent possible collisions the transmission is enabled.

T TR RA AN NS SM MI IT TT TE ER R
/ /R RE EC CE EI IV VE ER R
C CO OD DE E/ /D DE EC CO O
E ET TH HE ER RN NE ET T
C CO OM MU UN NI IC CA AT TI IO ON N
M ME EM MO OR RY Y
C CO OD DE E/ /D DE EC CO O I IP P
C CO ON NF FI IG GU UR RA AT TI IO ON N
M ME EM MO OR RY Y
S SE EN NS SO OR RS S
S SE EN NS SO OR R
M MA AN NA AG GE ER R
25

Figure 2. Code/deco ethernet internal block diagram
If channel is free, AT block generates a signal
(TXSTARTREQUEST) to require starting transmission. If
transmitter activates the TXSTARTCONFIRM signal, AT block
sends the first data octet and sets DATAREQUEST signal. It
waits until DATACONFIRM signal is set to indicate octet recep-
tion to send next octet. When the complete frame is transmit-
ted, AT block activates the TXENDREQUEST signal to stop
transmission. Transmitter responds activating the TXENDCON-
FIRM signal.
In the reception process, when the transmitter/receiver
detects a valid frame is starting, RXSTARTINDICATION signal
is set to activate AT block in reception mode. Presence of a
valid data octet at the output of the transmitter is indicated by
the DATAINDICATION signal. RXENDINDICATION signal is set
when the complete frame was received.
Data are received by AT block and CRC CHECKER block.
Data are processed by AT block and stored in the
COMMUNICATION MEMORY. Simultaneously, CRC CHECKER
block checks redundancies through a feedback shift register
like is proposed in XILINX application notes. [4]
When CRC is validated stored data packet are ready to be
used by the CODE/DECO IP. Any other situation stops the
process and the data are rejected.
COMMUNICATION MEMORY was implemented in one of
the block RAM with two read/write ports available in
SPARTAN III devices. To be used in message answer construc-
tion, origin and destination MAC address are stored.
When corresponding fields from the ETHERNET incoming
frame are loaded in COMMUNICATION MEMORY by
CODE/DECO ETHERNET block, signal PACKET OK validates
data presence, and IP protocol encoder/decoder block
(CODE/DECO IP block) manages the IP protocol and extracts
the CONFIGURATION DATA telegrams.
IP protocol encoder/decoder block is designed as a finite
state machine, with an initial waiting state and one state for
each field of the received packet. If the state corresponds to a
transmission control information field, data are verified and
when they are validated the system change to the next state.
Otherwise packet is discarded, and the system returns to the
initial state.
If the corresponding IP address is detected, decoding
process is accomplished and data are stored in sensor
subsistem memory (CONFIGURATION MEMORY).
When sensor unit (SENSOR MANAGER) generates a data
telegram, a valid data signal is activated and IP block starts
the transmission procedure storing this data in the
COMMUNICATION MEMORY and setting PACKET OK signal.
III. RESULTS
A. Simulation results
Simulation results for an incoming appointed ETHERNET
frame is shown in figure 3. Once an ETHERNET frame is de-
tected by the transmitter, RXSTARTINDICATION signal turns to
high, and DATAINDICATION signal validates the incoming
bytes. When destination MAC address is recognized, DIR OK
signal turns to high, and when IBSS identifier is verified,
BSSOK signal turns to high. As data come in, duration field,
destination address, and data bytes are stored (see DATAAIN
signal). CRC is checked and FRAME OK signal turns to high to
validate the frame.

(a) First 16 bytes

(b) Rest of frame
Figure 3. Reception of an appointed frame
IP packet decoding simulation results are shown in figure
4. Packet data stored on communication memory are read
and decoding process is started by the CODE/DECO IP stage
(see ADDRB_IN and DATAB_IN signals). Once this process is
completed, data are stored in the configuration memory (see
ADDRA_OUT and DATAA_OUT signals).
F FR RO OM M / / T TO O
C CO OD DE E / /D DE EC CO O I IP P
F FR RO OM M / / T TO O
M ME EM MO OR RY Y
C CR RC C
D DA AT TA A
C CR RC C 3 32 2

F FR RO OM M / / T TO O
T TR RA AN NS SM MI IT TT TE ER R

C CR RC C C CH HE EC CK KE ER R






A AT T
C CC CA A I IN ND DI IC CA AT TI IO ON N
T TX XS ST TA AR RT TR RE EQ QU UE ES ST T
T TX XS ST TA AR RT TC CO ON NF FI IR RM M
D DA AT TA AC CO ON NF FI IR RM M
D DA AT TA AR RE EQ QU UE ES ST T
T TX XE EN ND DR RE EQ QU UE ES ST T
T TX XE EN ND DC CO ON NF FI IR RM M
R RX XS ST TA AR RT TI IN ND DI IC CA AT TI IO ON N
D DA AT TA AI IN ND DI IC CA AT TI IO ON N
R RX XE EN ND DI IN ND DI IC CA AT TI IO ON N
26

Figure 4. IP packet decoding

Figure 5. Sensor data telegram generation
When an activation signal from one sensor is detected,
SENSOR1 signal is activated and the corresponding telegram
generation is started. FREE signal indicates that channel is
clear, and telegram bytes are transmitted through DATA and
VALID DATA signals as it is shown in figure 5.
Simulation results for the transmission of a test frame are
shown in figure 6.
Once the transmission request from the CODE/DECO IP is
detected by AT block, the output message is assembled.
When the transmission medium is free (signal CCA
INDICATION set) signal TXSTARTREQUEST from AT block turns
to high to require starting transmission. Transmitter activates
the TXSTARTCONFIRM signal, and AT block sends the first
data octet setting the DATAREQUEST signal When
DATACONFIRM signal is set to indicate octet reception, next
octet is sent. When the complete frame is transmitted, AT
block activates the TXENDREQUEST signal to stop transmis-
sion. Transmitter responds activating the TXENDCONFIRM
signal.

(a) Start of transmission

(b) End of transmission
Figure 6. Transmission of a test frame
B. Hardware Results
Prototype was implemented using Digilent S3 SKB de-
velopment boards for SPARTAN 3 devices [5]. Designed node
occupies 124 from a total of 1920 slices (about 6 % of the
FPGA total capacity).
IV. CONCLUSIONS
SPARTAN 3 design and implementation of a configur-
able via INTERNET domotic network was presented. Mean-
ingful simulation results using XILINX ISE platform simula-
tion software are shown. Simulation results were validated
with successfully communication tests done over prototype
implemented in Digilent S3 SKB development boards.
V. REFERENCES
[1] Schiavon M. I., Crepaldo D., Martn R. L., Varela C. Dedicated
system configurable via Internet embedded communication
manager module, V Southern Conference on Programmable
Logic, San Carlos, Brasil (2009) pp 193-197.
[2] IEEE, IEEE STD 802.11-2007, Revision of IEEE STD 802.11-
1999,. June 2007.
[3] Waisbrot, J. Request For Comments: 791, http://www.rfc-
es.org/rfc/rfc0826-es.txt
[4] Borrelli C. IEEE 802.3 cycle redundancy check, XILINX,
App. Note XAPP209. March, 2001.
[5] Digilent S3 SKB development boards, SPARTAN 3 FPGA, and ISE
platform , http://www.xilinx.com

27

28

MIC A NEW COMPRESSION METHOD OF INSTRUCTIONS IN HARDWARE FOR
EMBEDDED SYSTEMS
Wanderson R. A. Dias, Raimundo da S. Barreto
Department of Computer Science - DCC
Federal University of Amazonas - UFAM
wradias@gmail.com, xbarretox@gmail.com

Edward David Moreno
Department of Computer Science - DCOMP
Federal University of Sergipe - UFS
edwdavid@gmail.com

ABSTRACT
Several factors are considered in the development of
embedded systems, among which may be mentioned: physical
size, weight, mobility, energy, memory, freshness, safety, all
combined with a low cost and way of use. There are several
techniques to optimize the execution time and power
consumption in embedded systems. One such technique is the
compression code, the majority of existing proposals focus on
decompression assuming the code is compressed in time
compilation. This article proposes the development of a new
method of compression/decompression code implemented in
VHDL and prototyped on an FPGA, called MIC (Middle
Instruction Compression). The proposed method was
compared with the traditional method Huffman also
implemented in hardware. The MIC showed better
performance compared with Huffman for some programs
MiBench, widely used in embedded systems, 71% increase in
clock frequency (in MHz) and 36% more in compression
codes compared with the method of Huffman, and allows the
compression and decompression at runtime.
1. INTRODUCTION
Embedded systems are any systems digital are incorporated
into other systems in order to add or optimize features [16].
Embedded systems have the task to monitor and/or control the
environment in which it is inserted. These environments may
be present in electronic devices, appliances, vehicles,
machinery, engines and many other applications.
The growing demand for the use of embedded systems has
become increasingly common, prompting the implementation
of complex systems on a single chip, called System-on-Chip
(SoC). In this case, the embedded processor is a key
component of embedded computer systems [4]. Today, many
embedded processors found in the market are based on
architectures of high-performance (e.g., RISC architectures of
32-bit) to ensure a better computational performance for the
tasks to be performed. Therefore, the design of embedded
systems for high-performance processors is not a simple task.
It is known that many embedded systems are powered by
batteries. For this reason, it is critical that these systems are
able to control and manage power, thus enabling a reduction in
energy consumption and control of heating. Therefore,
designers and researchers focused on developing techniques
that reduce energy consumption while maintaining
performance requirements. One such technique is the
compression of the code of instructions in memory.
Most of the techniques, methodologies and standards for
software development, for the control and management of
energy consumption, do not seem feasible for development of
embedded systems because they possess several limitations of
computing resources and physical. Current strategies designed
to control and manage energy consumption have been
developed for general-purpose systems, where the cost of
additional processors or memory are usually insignificant.
The code size increases significantly as the systems
become more heterogeneous and complex. In this sense, there
was a high technical level that seeks to compress the code at
compile time and their relief, in turn, is made at run time [12,
13, and 14].
The compression technique was developed in order to
reduce the size code [15]. But over time, groups of researchers
found that this technique could be of great benefit to the
performance and energy consumption in general-purpose
systems and embedded systems. Once the code is compressed
in memory is possible on each request processor, get a much
larger amount of instructions contained in memory. So is there
a decrease in the activities of transition pins memory access,
leading to a possible increase in system performance and a
possible reduction in energy consumption of the circuit [15].
Likewise, when storing compressed instructions in the
cache increases the number of instructions stored in the cache
and increases your hit rate (hit rate), reducing search in main
memory, increasing system performance and therefore,
reducing energy consumption.
This article presents the development of a new method of
compressing and decompressing instructions (at runtime),
which was implemented in VHDL (Very Hardware
29

Description Language) [5] and prototyped in an FPGA (Field
Programmable Gate Array) [3], called MIC (Middle
Instruction Compression), which was compared with the
traditional method of Huffman also implemented in hardware,
and was shown to be more efficient than the method of
Huffman from a comparison using the benchmark MiBench
[7].
The rest of the paper is organized as follows: Section 2
presents the related work, Section 3 explains the architecture
PDCCM developed for the MIC method, Section 4 details the
description of the method MIC; Section 5 shows the
simulations with benchmark MiBench using MIC methods and
Huffman finally, Section 6 presents conclusions and ideas for
future work.
2. RELATED WORK
This section lists some researches founded in the literature
related to compressed instruction codes.
WOLFE & CHANIN [17] developed the CCRP
(Compressed Code RISC Processor), which was the first
hardware decompression implemented in a RISC processor
(MIPS R2000) and was also the first technique to use the
failures of access to the cache mechanism to trigger the
decompression.
The CCRP has similar architecture to the standard RISC
processor and thus the models of the programs are unchanged.
This implies that all existing development tools for RISC
architecture, including compilers optimized, functional
simulators, graphics libraries and others, also serve to
architecture CCRP. The unit of compression used is the cache
line of instructions. Every failure of access to the cache, the
instructions are fetched from main memory, uncompressed and
feed the cache line where there was a failure [17]. The fact that
the CCRP has to perform decompression of the instructions
before storing them in cache is advantageous in that the
addresses contained in the cache jumps are the same as the
original code. This solves most problems of addressing; there
is no need to resort to gimmicks such as (I) put extra hardware
in the processor for different treatment of jumps, and (II) make
patches address jump.
The technique used the CCRP Huffman coding [8]
generated by a histogram of occurrences of bytes of program
and showed a compression ratio of 73% on average for the
tested package (consisting of the programs nasa1, nasa7,
tomcatv, matrix25A, espresso, fpppp and others). For memory
models slower DRAM (Dynamic Random Access Memory),
processor performance was mostly mildly improved. For
models faster memory EPROM (Erasable Programmable Read
Only Memory), performance suffered a slight degradation.
AZEVEDO [2] proposed a method called IBC (Instruction
Based Compression), which is to perform the division of
instruction set processor classes, taking into account the
number of events along with number of elements in each class.
Research by AZEVEDO [2] showed better results in
compression of 4 classes of instructions. The compression
technique developed is to group pairs in the format [prefix
codeword] that replace the original code. In pairs formed, the
prefix indicates the class of instruction and serves as a
codeword index for the table of instructions.
The process of decompression is performed in 4 pipeline
stages. The first stage is called INPUT where the address is
converted processor (code uncompressed) in the main memory
address. The second stage is called FETCH, which is
responsible for the search word in the compressed main
memory. The third stage is known as DECODE where it is
really held the decoding of codewords. And finally the fourth
stage, called the output, the query is performed in the
dictionary of instruction to be provided to the processor
instruction. In tests, AZEVEDO [2] obtained a compression
ratio of 53.6% for the MIPS processor and 61.4% for the
SPARC (Scalable Processor Architecture). The performance,
there was a loss of 5.89% using the method IBC.
BENINI et al [4] developed a compression algorithm that
is suitable for efficient implementation of the hardware
(decompressor). The instructions are packaged in groups that
are the size of a cache line and its decompression occurs at the
moment that is extracted from the cache. The experiments
were performed with the DLX processor, due to even have a
simple architecture of 32 bits and also be a RISC architecture.
In addition, the DLX processor is similar to several
commercial processors family ARM [1] and MIPS. A table of
256 positions was used to store the instructions executed more.
Each cache line consists of 4 original instructions or a set of
instructions compressed and possibly interspersed with non-
compressed, prefixed by a word of 32 bits. The word is not
compressed in a fixed position from the cache and serves to
differentiate a cache line with instructions from the other lines
compressed with the original instructions. Indeed, a
compressed cache line does not necessarily contain all the
instructions compressed, but should always be a number
between 5 and 12 instructions in the compressed cache line to
be advantageous the use of compression [4].
To avoid the use tables of address translation, BENINI et
al, require that the destination addresses are always aligned to
32 bits (word). The first word (32 bits) of the cache line
contains an L mark and a set bits of flag. The brand is an
opcode instruction is not used, ie, an opcode that signals a
compressed line (in the DLX processor opcodes are 6 bits).
The compression algorithm developed by BENINI et al [4]
analyzes the code sequentially from the first instruction
(assuming that each cache line is already aligned) and tries to
pack instructions in adjacent rows compressed. The
experiments carried out in several packages of the benchmark
C code provided by Ptolemy project [6] proved that there was
an average reduction in code size by 28% and an average
savings in energy consumption by 30%.
LEKATSAS et al [10, 11], developed a decompression
unit with a single cycle. Decompression can be applied to
instructions of any size of a RISC processor (16, 24 or 32 bits).
The only specific application is part of the interfacing between
the processor and memory (main or cache). The
decompression mechanism is capable of decompressing one or
two instructions per cycle to meet the demand of the CPU
without increasing the runtime. They developed a technique to
create a dictionary that contains the instructions that appear
more frequently. The dictionary code refers to a class of
30

compression methods replacing sequences of symbols with the
contents of a table. This table is called a dictionary and the
contents are codewords in the compressed program [11].
The main advantage of this technique is that the rates are
usually fixed-length, and thus simplifies the logic of
decompression to access the dictionary and also reduces the
latency of the decompression. The results obtained in tests
carried out showed that there was an average gain of 25%
performance in execution time of applications using the
compression code and an average of 35% reduction in code
size. The technology developed is not limited to only one
processor, but can be applied and achieve similar results on
other processors.
LEFURGY et al [9] proposed a compression technique
based on the code of the program code using a code dictionary.
Thus, compression is performed after compiling the source
code, however, the object code is analyzed and the common
sequences of instructions are replaced by a coded word
(codeword), as in text compression. Only the most frequent
instructions are compressed. A bit (bit escape) is used to
distinguish one word compressed (encoded) of an
uncompressed instruction. The instructions corresponding to
the compressed instructions are stored in a dictionary in the
decompression hardware. The compressed instructions are
used to index the dictionary entries. The final code consists of
codewords mixed with uncompressed instructions.
It is observed that one of the most common problems
found in the compression code refers to the determination of
the target addresses of jump instructions. Usually this type of
instruction (direct diversion) is not coded to avoid the need to
rewrite the code words that represent these instructions [8].
Since deviations overhead can be encoded normally, because,
as their target addresses are stored in registers, only the code
words need to be rewritten. In this case, you need only one
table to map the addresses stored in the original registrar for
the new addresses tablets.
This method differs from other methods seen in literature
that addresses the goals are always aligned to 4 bits (size of a
codeword), not the size of the word processor (32 bit). As
advantage it seems a better compression, but the disadvantage
there is a need for changes in the core processor (extra
hardware) to address gaps to address aligned to 4 bits.
However, it is unclear details about the interaction of hardware
decompression with experienced processors (PowerPC, ARM
and i386). The operation of hardware decompress or is done
basically as follows: The instruction is fetched from memory,
if a codeword, the decoding logic of specific codeword gets
the offset and its size will serve as an index to access the
uncompressed instruction in the dictionary and pass processor.
If instructions are not compressed, they are passed directly to
the processor. With the method proposed in [9], were obtained
compression ratios of 61% for the PowerPC processor, 66%
for the ARM processor and 75% for the i386 processor. The
metrics of performance and power consumption were not
expressed.
3. PDCCM ARCHITECTURE AND MIC
METHOD
In the literature we have found two basic types of
architectures, code compression, CDM and PDC, which
indicate the position of the decompressor for the processor and
memory subsystem, as shown in Figure 1. The CDM
architecture (Cache Memory Decompressor) indicates that the
decompressor is positioned between the cache and main
memory, while the architecture PDC (Processor Cache
Decompressor) places the decompressor between the processor
and cache.
Fig. 1. Architectures decompression code:
(a) CDM e (b) PDC [12].

As previously mentioned (Section 2), the development of
architectures for compression or decompression code
instruction is done separately, in most of the work of the treaty
only because the decompressor hardware compression of the
instructions is usually done through changes in the compiler.
Thus, the compression is performed at compile time and
decompression is done at run time using a specific hardware
decompression.
To operate the MIC method proposed in this work, it was
necessary to develop a new architecture, hardware, to carry out
the compression and decompression of the instruction code at
runtime. The architecture was created titled PDCCM Processor
(Compressor Decompressor Cache Memory) in which it is
shown that hardware compression was inserted between the
cache and main memory and hardware decompression was
inserted between the processor and memory cache. PDCCM
The architecture was implemented in VHDL and prototyped
on an FPGA manufacturer ALTERA

[18].
The architecture works with instructions PDCCM size of
32 bits, that is, each line of instruction cache consists of 4
bytes. Thus, the architecture developed is compatible with
systems using the ARM processor as the core of your
embedded system, because this processor features a set of
instructions 32 bits. PDCCM In architecture, using the method
of compression / decompression MIC all instructions that are
recorded in the instruction cache will suffer a 50%
compression in its original size.
Figure 2 shows the architecture PDCCM developed to
implement the new method of compressing and decompressing
instructions in hardware (MIC), which consists of four basic
components, and they are:
31

LAT (Line Address Table): is a table that has the
function to map the addresses of the instructions with
your new address in the instruction cache;
ST (Sign Table): is a table that contains bits that serve as
flags to indicate to the decompressor which pair of bits
should be reconstituted, uncompressed;
Compressor: is to enforce the compression of all
instruction codes that will be saved in the instruction
cache. The compressor is started every time the RAM
has an access and a new instruction is passed on to be
saved in the instruction cache;
Descompressor: is to enforce the decompression of all
instructions that are stored in the instruction cache and
will be passed to the processor. The decompressor is
triggered every time the LAT is found and return a hit.
Fig. 2. PDCCM Architecture.
4. A COMPRESSION ALGORITHM: MIC
METHOD
The MIC method (Middle Instruction Compression) is a
compression method which is to reduce by 50% the size of
instruction codes that are stored in the instruction cache, and
then passing the length of the 32 bit instructions (original size)
to 16 bits (compressed size).
The MIC method requires an additional memory
components used by the ST and LAT, which store the set of
flags of the compressed instruction and mapping of the new
addresses of the compressed instructions in the cache,
respectively.
For compression, each instruction is read into memory and
saved to the instruction cache will be split into pairs of bits
with each pair consisting of: 00, 01, 10 and 11. Compressor
MIC performs the following logic: pairs with equal values are
replaced by bit 0 (zero) and pairs with different values are
replaced by bit 1 (one). Then the bits 00 and 11 in compression
are replaced by bit 0 and bits 01 and 10 are replaced by bit 1.
So, a couple of bits are reduced to a single bit.
An auxiliary table (ST) is used to store the set of flags of
double bit compressed. Pairs of bits that start with the value 0
(zero), such as 00 or 10 is recorded in the ST bit 0 and bit pairs
that start with the value 1 (one), such as 11 or 01, is recorded
in the ST bit 1. It is noteworthy that the mode of address lines
of instruction for this architecture is Big-Endian.
For further clarification and, where possible, the names of
components, variables, and input pins and output (input and
output) are similar to those used in the code implemented in
VHDL.
4.1. Compression/Decompression Process
The processor requests an instruction to the instruction cache
through a pin, which for this implementation was called:
end_inst_proc (Current PC). The LAT will be checked
whether or not the address provided by the processor. If the
instruction is found in the instruction cache the LAT signal
with a HIT. So, the LAT will provide the new address of the
instruction in the instruction cache, the address of the set of
flags in ST instruction and placement of double-byte (first or
second) where the proceedings and flags in the instruction
cache and ST, respectively. All this information is passed to
the decompressor to reconstruct the instruction and return the
uncompressed form to the processor through the variable
returnD_inst_proc.
The decompression of instruction codes is performed as
follows:
The new address of the instruction that was passed by the
LAT, was located in the instruction cache and ST;
The instruction cache and ST return to the decompressor
the 16-bit compressed instruction and 16-bit set of flags;
If the bit read from the compressed instruction in the
instruction cache is 0 (zero), the pair of bits to be
reconstructed is 00 or 11. What defines how the pair of
bits is the bit flag, that is, if the flag bit is 0 the pair of
bits to be reconstructed is 00 and if the flag bit is 1 the
pair of bits to be reconstructed is 11;
But if the bit read from the compressed instruction in the
instruction cache for 1 (one), the pair of bits to be
reconstructed is 10 or 01. Then again the flag bit that
defines how the pair of bits, if the flag bit is 0 the pair of
bits to be reconstructed is 10 and if the flag bit is 1 the
pair of bits to be reconstructed is 01;
For each instruction to be decompressed, we analyze the
16-bit instruction saved in the instruction cache, thus
transforming the instructions compressed 16-bit
instructions in 32 bit uncompressed.

Now, if the address provided by the processor is not in the
LAT, it means that there is no such instruction in the
instruction cache. The LAT signal a FAILURE from the
instruction cache. The address provided by the processor will
be transferred to the RAM (Random Access Memory), where
it will be found and verified whether or not this instruction. If
the search in a FAILURE RAM, the instruction is fetched in
the HD (Hard Disk). Now if the fetch process indicates a hit, it
means that the instruction is in RAM. Next, the RAM returns a
copy of the instruction in the original format (uncompressed)
to the processor through the variable returnC_inst_proc
and another copy to the compressor, which will make the
whole process of compression.
The compression of instruction codes is performed as
follows:
The instruction is placed in RAM and a copy of it is
passed to the processor and one for the compressor;
32


The instruction in the compressor is splited into 16 pairs
of bits, and each pair is formed at the moment that is read
by the function of compression. The beginning of the
instructions coming from the RAM is the MSB (most
significant bit);
The compressor always consider what part of the double-
byte (first or second) must be saved in the compressed
instruction cache and instruction ST;
If the pair of bits read for the compression is 00 or 11,
then this pair of bits will be replaced by bit 0 and saved
in the instruction cache. Now if the pair of bits read for
10 or 01 then this pair of bits will be replaced by bit 1
and saved in the instruction cache;
The set of flags of ST will be formed through the
following logic: if the first bit of double bit being
compressed is 0, then the flag bit saved is 0. Now if the
first bit of double bit being compressed is 1, then the flag
bit will be 1 unless;
After the compressor to replace the entire 32 bit
instruction in the original 16 bit compressed and its bits
flags, the compressor will save the couple of bytes (first
or second) compressed in the instruction cache and
instruction set of flags in ST;
The LAT table is updated with the new instruction
address saved in the instruction cache and ST;
For each instruction that is sought in memory, repeat this
process of compression.

Important to highlight out in our approach these
mechanisms of compression and decompression are performed
at runtime, by specialized hardware which was prototyped in
FPGAs. The PDCCM architecture has a small loss of
performance due to the additional cycle in the pipeline. In our
hardware implementation we have founded that this result is
similar that obtained for LEKATSAS in [10, 11], since we
found a component that needs only a single cycle for the
compression or decompression, and the benefits are shown in
the next section.
5. SIMULATIONS WITH MIBENCH
The benchmark used in the simulations of compression and
decompression methods are Huffman and MIC package
MiBench [7] specifically for embedded systems and different
categories, which are in assembly code ARM9 processor, as
found in [19, 20, 21, 22]. The category and functionality of
MiBench benchmark used in the simulations are: CRC32,
JPEG, QuickSort and SHA.
We used the instruction set embedded processor ARM
(ARM9 family, version ARM922T, ARMv4T ISA) to simulate
the operation of the compressor and decompressor MIC
methods in architecture and Huffman PDCCM. However, the
chosen processor (ARM) is the type and has a RISC
instruction set consists of 32 bits (instruction) that enabled him
to be a good platform to simulate the architecture PDCCM.
For the simulations of compression and decompression of
MIC methods and Huffman were selected the 256 first
instructions for each MiBench (due to physical limitations of
the FPGA used for prototyping), obtained by the compiled
code (Assembly) to the embedded processor ARM, forming so
the set of sequences of instructions that were used to load a
piece of RAM memory and instruction cache. For more
details, see [23].
The stretch of RAM described in VHDL was used in all
simulations with the benchmark MiBench and had fixed size
of 256 lines of 4 bytes each (modeling a memory 1Kbyte),
thus accounting for 8.192 bits and the instruction cache has a
size of 32 lines of 32 bits each (totaling then a cache
instruction 1Kbit). Thus, we observe that there is a 8:1 ratio
between the sizes of RAM and its instruction cache.
Table 1 shows the averages in the timing of PDCCM
architecture, using both methods and MIC Huffman
compression and decompression of the instructions for some
programs MiBench.

Table 1. Delay in FPGA.
MIC Huffman
Compression
Time in the worst case 9.314 ns 9.849 ns
Clock in MHz 33.52 MHz 13.16 MHz
Clock time 30.398 ns 76.020 ns
Decompression
Time in the worst case 9.234 ns 11.006 ns
Clock in MHz 30.92 MHz 5.52 MHz
Clock time 32.554 ns 184.606 ns

By observing the table 1, the MIC method showed better
timing in the FPGA for all benchmark MiBench analyzed. In
compression, it is observed a difference of more than 60% at
the clock frequency (in MHz). Since the time in the worst case,
for the two methods were very similar. Decompression, is a
visible difference even greater, or more than 82% at the clock
frequency (in MHz).
Based on the 256 first instructions of benchmark MiBench
obtained from assembly code compiled for ARM platform,
Table 2 that the method MIC depressed by 50% the size of
instructions, ie 256 lines of the passage of RAM used the
simulation, after the compression process has moved only 128
lines in the instruction cache. Since the instructions
compressed using the Huffman received an overall average
compression of 32% less in relation to the size of the RAM
used in the simulation.

Table 2. Comparison of the rate of compression.
MiBench
256 instructions
MIC Huffman
CRC32 128 (50%) 159 (38%)
JPEG 128 (50%) 181 (29%)
QuickSort 128 (50%) 192 (25%)
SHA 128 (50%) 164 (36%)
Averages 128 (50%) 174 (32%)

Based on the results, we find that for architecture PDCCM
using the 256 first instructions of the benchmark MiBench
(CRC32, JPEG, QuickSort and SHA), the MIC method was
33


more efficient when compression phase, since we have a
percentage 36% higher when compared to the Huffman
method.
6. CONCLUSIONS AND FUTURE WORKS
This summary presented a study of research on
compression/decompression instruction code, and architectures
(CDM Decompressor Cache Memory) that indicates the
position of the decompressor between cache and main memory
and PDC (Processor Cache Decompressor) suggests that the
positioning of the decompressor between the processor and
cache.
The article described a new method of compression, called
MIC, which was prototyped in FPGAs, and proved to be
feasible for embedded systems that use RISC architecture. For
the future this technique may become a necessary component
in embedded systems projects. With the use of compression
techniques of code, RISC architectures can minimize one of
their biggest problems, which is the amount of memory to
store programs.
Through simulations carried out with some benchmark
programs MiBench found that the MIC method showed the
following averages: frequency (MHz) operating at
approximately 3 times higher for the processes of
compression/decompression of instruction codes and 36%
more efficient at compression rate of MiBench analyzed in
relation to the method of Huffman, who also was prototyped in
hardware.
Therefore, analyzing the data obtained through the
simulations, it is concluded that the method developed and
presented in this paper, called the MIC was more
computationally efficient compared with the method of
Huffman implemented in hardware. The simulations used the
programs CRC32, JPEG, QuickSort and SHA MiBench
benchmark for performance measurements.
As future work are: to design and implement a RISC
processor that already has the hardware built-in compressor
and decompressor at its core; testing of compression and
decompression methods and MIC Huffman more MiBench
benchmark programs and reach an implementation in ASIC, so
that this project goes beyond the academic realm, serving as a
contribution also to the industrial sector.
7. REFERENCES
[1] ARM. An Introduction to Thumb. Advanced RISC Machines
Ltd., March 1995.
[2] AZEVEDO, R. An architecture for code tablet Dedicated
Systems. PhD thesis, IC, UNICAMP, Brazil, June 2002.
[3] COSTA, C. da. Designing Digital Controllers with FPGA.
So Paulo: Novatec Publisher, 2006, 159p.
[4] BENINI, L.; MACII, A.; NANNARELLI, A. Cached-Code
Compression for Energy Minimization in Embedded Processor.
Proc. of ISPLED'01, pages 322-327, August 2001.
[5] D'AMORE, R. VHDL - Description and Synthesis of Digital
Circuits. Rio de Janeiro: LTC, 2005, 259p.
[6] DAVIS II, J.; GOEL, M.; HYLANDS, C.; KIENHUIS, B.;
LEE, E. A.; LIU, J.; LIU, X.; MULIADI, L.;
NEUENDORFFER, S.; REEKIE, J.; SMYTH, N.; TSAY, J.;
XIONG, Y. Overview of the Ptolemy Project, ERL Technical
Memorandum UCB/ERL Tech. Report N M-99/37, Dept.
EECS, University of California, Berkeley, July 1999.
[7] GUTHAUS, M.; RINGENBERG, J.; ERNST, D.; AUSTIN,
T.; MUDGE, T.; BROWN, R. MiBench: A Free, Commercially
Representative Embedded Benchmark Suite. In Proc. of the
IEEE 4th Annual Workshop on Workload Characterization,
pages 3-14, December 2001.
[8] HUFFMAN, D. A. A Method for the Construction of
Minimum-Redundancy Codes. Proceedings of the IRE,
40(9):1098-1101, September 1952.
[9] LEFURGY, C.; BIRD, P.; CHEN, I-C.; MUDGE, T.
Improving Code Density Using Compression Techniques. In
Proc. Int'l Symposium on Microarchitecture, pages 194-203,
December 1997.
[10] LEKATSAS, H.; HENKEL, J.; JAKKULA, V. Design of
One-Cycle Decompression Hardware for Performance Increase
in Embedded Systems. In Proc. ACM/IEEE Design Automation
Conference, pages 34-39, June 2002.
[11] LEKATSAS, H.; WOLF, W. Code Compression for
Embedded Systems. In Proc. ACM/IEEE Design Automation
Conference, pages 516-521, June 1998.
[12] NETTO, E. B. W. Code Compression Based on Multi-
Profile. PhD thesis, IC, UNICAMP, Brazil, May 2004.
[13] NETTO, E. B. W.; AZEVEDO, R.; CENTODUCATTE, P.;
ARAJO, G. Mixed Static/Dynamic Profiling for Dictionary
Based Code Compression. The Proc. of the International
System-on-Chip Symposium, Finland, pages 159-163,
November 2003.
[14] NETTO, E. B. W.; AZEVEDO, R.; CENTODUCATTE, P.;
ARAUJO, G. Multi-Profile Based Code Compression. In Proc.
ACM/IEEE Design Automation Conference, pages 244-249,
June 2004.
[15] NETTO, E. B. W.; OLIVEIRA, R. S. de; AZEVEDO, R.;
CENTODUCATTE, P. Code Compression in Embedded
Systems. HOLOS CEFET-RN. Natal, Year 19, pages 23-28,
December, 2003. 94p.
[16] OLIVEIRA, A. S. de; ANDRADE, F. S. de. Embedded
Systems - Hardware and Firmware in Practice. So Paulo:
Publisher rica, 2006, 316p.
[17] WOLFE, A.; CHANIN, A. Executing Compressed Programs
on an Embedded RISC Architecture. Proc. of Int. Symposium
on Microarchitecture, pages 81-91, December 1992.
[18] ALTERA

Corporation. Available at: www.altera.com.


Accessed on 09 de July of 2008.
[19] Assembly code compiled for the ARM9's MiBench CRC32.
Available at: www.efn.org/~rick/work/. Accessed February 17,
2009.
[20] Assembly code compiled for the ARM9's MiBench JPEG.
Available at: www.zophar.net/roms/files/gba/supersnake.zip.
Accessed February 17, 2009.
[21] Assembly code compiled for the ARM9's MiBench
QuickSort. Available at:
www.shruta.net/download/archives/project/report/5/5.2/ARM9.
Accessed February 17, 2009.
[22] Assembly code compiled for the ARM9's MiBench SHA1.
Available at: www.openssl.org/. Accessed February 17, 2009.
[23] DIAS, W. R. A. Architecture PDCCM in the Hardware for
Compression/Decompression of Instructions in Embarked
Systems. M.Sc. Dissertation, DCC, UFAM, Brazil, April 2009.
34
EMBEDDED SYSTEM THAT SIMULATES ECG WAVEFORMS
Thyago Maia Tavares de Farias
Programa de Ps-Graduao em Informtica
Universidade Federal da Paraba
Cidade Universitria - Joo Pessoa - PB
Brasil CEP: 58059-900
email: thyagomaia@hotmail.com
Jos Antnio Gomes de Lima
Programa de Ps-Graduao em Informtica
Universidade Federal da Paraba
Cidade Universitria - Joo Pessoa - PB
Brasil CEP: 58059-900
email: jose@di.ufpb.br

Fig. 1. Typical ECG signal.
ABSTRACT
This paper describes a embedded system developed for
simulation of electrocardiographic signals, also known as
ECG signals. The objective of this system is generate
several examples of ECG waveforms for analysis and
reviews in short time periods, eliminating the difficulties of
obtaining real ECG signals through the invasive and
noninvasive methods. One can simulate any given ECG
waveform using this embedded system. This simulator was
developed through the Alteras Nios II development kits
and Alteras CAD software, for definition of hardware
layer, beyond the use of Fourier series and Karthiks
algorithm implemented in language C through the Alteras
Nios II IDE, for implementation of software layer.
1. INTRODUCTION
According to Dirichlet [1], any periodic functions which
satisfies boundary conditions in which a state variable
remains constant over time, can be expressed as a series of
scaled magnitudes of sine and cosine terms of frequencies
which occur as a multiple of fundamental frequency.
Karthik [2] also states that ECG signals are periodics, with
frequency determined by heart beats, and satisfy Dirichlets
condition. Therefore, Fourier series [3] can represent ECG
signals. Fourier series are described in (1a).

f(x) =
(a /2) + a cos (nx / l) + b

o n
n=1 n=1
n
sin (nx / l)
(1a)
a
o
=
(1/ l ) f (x) dx, T = 2l
T
(1b)
a
n
=
(1/ l ) f (x) cos (nx / l) dx, n = 1,2,3
T
(1c)
b
n
=

(1/ l ) f (x) sin (nx / l) dx, n = 1,2,3
T
(1d)

This work is inspired in the algorithm implemented in
MATLAB script language by Karthik [2]. From the
definition of signal parameters as heart beats frequency,
amplitude and duration, the algorithm calculates separately
the portions P, T, U, Q, S and QRS of a tipical ECG signal.
These portions are illustrated in Fig. 1. The calculation of
each portion is based in Fourier series described in (1a).
Every significant feature of an ECG signal is generated
from the sum of each of these waveforms.
Developing a embedded system based on Karthik's
algorithm [2] enables the prototyping of hardware that will
help researchers in the analysis and reviews of
electrocardiographic signals. This prototype can be applied
in aid to corrective and preventive maintenance of various
types of electrocardiography equipments, can periodically
check the operating limits of heart monitors and similar
equipment and evaluate and compare the performance
between equipment from different manufacturers.
2. METHODOLOGY
The work involves description of the platform Nios II,
including internal peripheral and access to external devices,
software development with GNU C/C++ in Eclipse IDE
and debugging aided by hardware. Nios II is treated as a
reconfigurable processor soft-core. A set of standard
peripherals follows the platform and there is the possibility
of development of personalized peripherals. The
development kit used to design the holter monitor was the
Alteras Nios Development Board, Stratix Edition. This
35
Fig. 2. Hardware layer.
procedure qrs_q_s(amp, dur, hbr)
{
x = 0.01:0.01:600;

li = 30/hbr;

b = (2*li)/dur;

wave_1 = (amp/(2*b))*(2-b);

n = <TOTAL_NUMBER_OF_SAMPLES>;

for i = 1 to n
harm =
(((2*b*amp)/((i*i)*(PI*PI)))*(1
cos((i*PI)/b)))*cos((i*PI*x)/l);

wave_2 = wave_2 + harm;
end

final_wave = wave_1 + wave_2;
}

Fig. 4. Algorithm for the calculation of QRS, Q and S
portions.
Fig. 3. Software layer.
board provides a hardware platform for developing
embedded systems based on Altera Stratix devices.
3. HARDWARE LAYER
Fig. 2 shows the hardware structure of the simulator. The
Nios II Core executes the module of the software layer,
previously stored in SDRAM memory. The Avalon is a
special bus that prioritizes speed data-communication,
allowing connections in parallel. The PIO module offers
input and output ways, estabilishing a communication
between the Nios II platform and blocks used. The Flash
Memory device is a 8 Mbyte AMD AM29LV065D, used as
general-purpose readable memory and non-volatile storage.
The JTAG UART core uses the JTAG circuity built in to
Altera FPGAs, and provides host access via JTAG pins
on the FPGA.
The software used to define and generate the system are
the Quartus II and SOPC Builder.
4. SOFTWARE LAYER
Fig. 3 shows the software structure of the simulator. The
parameters of amplitude, duration, hearth beat rate and
intervals (P-R and S-T) are used to calculated the P, Q,
QRS, S, and U portions. Each sample portion is generated
from 2 procedures, where one will be responsible for
calculating the samples of QRS, Q and S portions, since
these parts can be represented by triangular waveforms [2],
and other will be responsible for calculating the samples of
P, T and U portions, since these parts can be represented by
sinusoidal waveforms [2]. Fig. 4 shows the algorithm for
the calculation of the QRS, Q and S portions, and Fig. 5
shows the algorithm for the calculation of the P, T and U
portions.
The main procedure is responsible to order the
necessary parameters for the calculation of wave portions in
auxiliary procedures and merge the same calculated
portions into a single wave, the ECG signal resulting.
Samples of the resulting signal are written to a text file, to
be opened in any software that generates graphics, such as
MATLAB and Excel.
36
procedure p_t_u(amp, dur, hbr, int)
{
x = 0.01:0.01:600;

x = x int;

li = 30/hbr;

b = (2*li)/dur;

u1 = 1/l;

n = <TOTAL_NUMBER_OF_SAMPLES>;

for i = 1 to n
harm =
(((sin((PI/(2*b))*(b-(2*i))))/(b-
(2*i))+(sin((PI/(2*b))*(b+(2*i)))
)/(b+(2*i)))*(2/PI))*cos((i*PI*x)
/l);

u2 = u2 + harm;
end

wave_1 = u1 + u2;

final_wave = wave_1 * amplitude;
}

Fig. 5. Algorithm for the calculation of P, T and U
portions.
Table 1. Input data used in tests.
Heart beat 72
Amplitude - P wave 25 mV
Amplitude R wave 1.60 mV
Amplitude Q wave 0.025 mV
Amplitude T wave 0.35 mV
Duration P-R interval 0.16s
Duration S-T interval 0.18s
Duration P interval 0.09s
Duration QRS interval 0.11s
Fig. 6. Obtained ECG signal after tests.
The software used for the development of this layer was
the Nios II Embedded Design Suite, and the language C
was used in the implementation. This IDE carries the
developed software for the SDRAM memory of the
Alteras development kit, to be executed by Nios II Core.
5. RESULTS
Table 1 shows the input data used in tests for the generation
of an ECG signal by the embedded system developed.
These values are used with default values by the simulator.
Other values can be specified for the generation of ECG
signals with distinct features. Fig. 6 shows the resulting
ECG signal.
6. CONCLUSION
The results obtained show that the developed embedded
system had success in simulate an ECG signal through the
Fourier series. This simulator can simulate any given ECG
waveform without using the ECG machine, removing the
difficulties of taking real ECG signals with invasive and
noninvasive methods.
7. REFERENCES
[1] G. L. Dirichlet. (1829, Jan.). Sur la convergence des sries
trigonomtriques qui servent representer une fonction
arbitraire entre des limites-dones. Journal fr die reine und
angewandte Mathematik. [Online]. 1829(4), pp. 157169.
Available:http://www.reference-
global.com/doi/abs/10.1515/crll.1829.4.157
[2] R. Karthik. (2009, Aug. 13). ECG Simulation Using
MATLAB Principle of Fourier Series. [Online]. Available:
http://www.mathworks.com/matlabcentral/fileexchange/108
58
[3] J. Fourier. (1826). Thorie du mouvement de la chaleur dans
les corps solides (suite). Mmoires de lAcadmie royale des
sciences de lInstitut de France. [Online]. pp. 153246.
Available:
http://gallica.bnf.fr/ark:/12148/bpt6k33707/f6n94.capture

37

38
An FPGA BASED CONVERTER FROM FIXED POINT TO LOGARITHMIC NUMBER
SYSTEM FOR REAL TIME APPLICATIONS
Elio A. A. De Mara, Carlos E. Maidana,
Fernando I. Szklanny
Grupo de Investigacin en Lgica Programable
Universidad Nacional de La Matanza
Florencio Varela 1903, San Justo
Prov. Buenos Aires - Argentina
email: gilp@unlam.edu.ar
ABSTRACT
This paper presents a high speed conversion system,
based on programmable logic arrays, to be used to convert
fixed point values, as those obtained at the output of a high
speed analog to digital converter, into the logarithmic
number system. It is a basic objective for this research to
obtain a real time conversion of the data outputs generated
by conversion devices in order to manage the obtained data
in a proper format which allows arithmetical operations in a
simple and efficient process. A conversion algorithm is
therefore suggested, which will avoid the use of tables and
interpolation methods. Another feature for this algorithm is
the fact it will be entirely developed in one FPGA without
the need of external hardware (as external RAM memories),
and with minimum use of internal resources.
1. OBJECTIVES
The need for real time or near real time numerical
operations, with a high level of accuracy, appears in
different areas of current technology. In the particular area
of analog to digital conversion, state of the art converters
are available working at sampling rates of around 1
Gsamples/sec.
In such cases, real time operations will require the use
of a representation system that will allow these calculations
to be made properly, and offering a good precision in the
obtained results. This requests a very good response time
for the logic circuits involved in such calculations.
For these application areas, exponential formats offer
important advantages over other representation systems,
because an important range of values can be shown, with an
adequate precision, suitable for most real world
applications.
The use of a floating point number system like the
traditional IEEE 754 standard [1] or a logarithmic number
system, is therefore suitable for the requested objective.
This paper responds to the need of solving arithmetical
operations in real time or near to real time systems, in order






to use the provided results in digital signal processing
applications.
It is a first objective of our project, depicted in this
paper, to develop an algorithm able to convert integer
numbers, as those to be obtained at the output of an analog
to digital converter, to Logarithmic Number System
representation. using numerical procedures that will not
require a high amount of hardware resources or time
consuming calculus procedures.
It is another objective of this project, to show that the
conversion can be made with a minimum conversion error,
comparable or better than the AD conversion error, and
better than the approximation errors associated with the
logarithmic number system itself.
It is a third objective of this project to install the
complete fixed point to LNS data converter in one unique
field programmable gate array, using the least possible
amount of hardware resources, specially referring to
sequential logic elements, and with no need for external
devices.
2. INTRODUCTION AND BACKGROUND.
During the last years, many researchers have analyzed the
characteristics of LNS representations. Many published
papers are based on a comparison with a conventional
floating point representation system, referring basically to
the way arithmetical operations are solved, and to the
precision and errors related to each representation systems.
Among these papers, those written by Matousek et al [2]
Haselman et al [3], Detrey and de Dinechin [4] can be
mentioned.
Matousek et al [2] analyze the logarithmic number
system configuration, including ranges and precision of the
system, in order to conclude that this representation system
is adequate for being used in FPGA devices.
Haselman et al [3] compare the logarithmic number
system and the IEEE floating point representation system,
39
and suggest an interesting conversion method between both
systems. This paper includes a deep analysis of the
hardware requirements needed for this conversion.
On the other hand, Detrey y de Dinechin [4] suggest a
VHDL operators library, to be used when LNS is used in
signal processing applications.
These and other papers refer to different considerations
about the way of solving arithmetical operations, especially
as regards to adding and subtracting, when working in
logarithmic number systems.
In some of these papers, and in order to obtain
minimum errors, the abovementioned operations are based
on tables and interpolation systems.
When the bit length of the numbers to be converted
rises, these tables can grow enough as to require memory
chips, external to the FPGA used for implementation of
such arithmetical operations.
The LNS representation system shows as a major
advantage the fact that the relative error in a numerical
representation is constant and only depends on the number
of bits included in the fractional part of the exponent. Thus,
what has to be decided is how many bits are needed, in the
LNS exponent, to convert a fixed point number with a
reasonable error, compatible with the standard
representation error.
On the other hand, and considering that the number to
be converted is an fixed point integer value obtained as the
output from a analog to digital converter, it is clear that this
number, acting as an input to the LNS converter, may
include an input error, which would be the output error of
the AD conversion, not higher than 0,5 bit. In this
condition, this last reference is to be taken into account in
order to limit the LNS representation to a number of bits
according to this required precision. This means that, even
if the usual LNS format show exponent numbers with an
integer part formed by eight integer bits and fractional part
formed by 23 bits, there may be no need for such amount of
bits considering the fixed point output error.
3. THE CONVERSION ALGORITHM.
Being N a number to be shown in LNS format, it will be
expressed as written in equation (1), where m and f are the
integer and fractional part of an exponent of number 2..

(1)

If a representation range similar to the one given by
IEEE 754 32 bits floating point standard is needed, it can be
shown that values for m and f are 8 and 23 bits,
respectively.
From equation (1), it can be stated that:

(2)

The value of the integer part of the exponent, m, will be
defined as a function of the number of bits of the integer
number to be converted. This number can be shown as a
normalized expression 1,mmm, through equation (3).

(3)

In this equation, exponent 0.f can be expressed as a sum
of negative powers of binary base 2, as shown in equation
(4).

(4)

The conversion algorithm developed in this paper is
based on obtaining the exponent fractional bits through
successive squaring of the normalized expression for
number N.
As the normalized number to be multiplied by itself has
a range of values going from 1 to almost 2, its square power
will adopt values from 1 to about 4, which, expressed in
binary numbers, would be represented by 01.xxxxxx...x to
11.xxxxxx...x, being xxxxxx...x the fractional part of the
mantissa, in both cases.
If the obtained second power of the number has an
integer part formed by two bits, the obtained value will
have to be normalized again. This will require a 1 value
added in the corresponding position of the exponent
fractional part, and normalization to 1.xxxxxx...x format,
truncating the fractional exponent to the original quantity of
bits.
If result obtained by multiplying the normalized number
by itself has an integer part of 1, the next bit in the
fractional part of the exponent will be a zero.
This procedure will be repeated until the required
precision for number N representation is obtained.
Therefore, the number of successive squaring steps
depends on the requested precision. If number N, in its
normalized expression, has a fractional part with a
predefined quantity of bits, its square has double quantity
of bits. Therefore, the developed algorithm will analyze the
need of truncation or rounding the obtained square, in order
to maintain the number of bits after further iterations
constant.
Truncation of the obtained result will result into an
error, which must be measured to be sure that the error
produced after the multiplications and normalization
process does not give a final result with an error exceeding
the error of the LNS representation system.
4. ALGORITHM IMPLEMENTATION
The proposed conversion algorithm was implemented using
a Virtex II Pro device, considering as a main objective to
obtain a design with minimum space requirements and
40
maximum speed. This led to a pipelined structure, which
allows, once that all the stages of the pipeline are full, to
obtain a new conversion with every clock pulse in the
device.
Figure 2 depicts the first stage of the pipeline, in care of
normalization of the incoming number to be converted.
This stage includes a priority encoder, which is responsible
for obtaining the integer part of the converted number N by
analyzing the first significant ones position in the incoming
number. It also includes a barrel shifter which will be used
to normalize the incoming number, considering this
normalization as obtaining a number within the range
1.mmmm...m. For simplicity purposes, this normalized
number, which is not represented in LNS, will be
hereinafter mentioned as the numerical mantissa of number
N.
The barrel shifter will simply shift the number from
right to left as many times as indicated by the priority
encoder.
Furthermore, as LNS has no representation for number
0, a zero flag is included, flag which is activated directly
from the incoming number.
Results coming out from this stage will be latched to
enter the fractional part of the exponent calculating block.
Stage 1 of this block, as shown in fig. 2, will multiply
this mantissa by itself, and the obtained result will be
truncated to 17 bits. As shown before, after calculating the
second power of the incoming normalized mantissa, the
most significant bit of the result will be included as a new
bit in the LNS fractional part of the exponent.
Each of the successive stages of this pipeline will add
one new bit to the EXPONENT register. The number of
stages will equal the number of bits in the fractional part of
the exponent. In our case, the final register will include 4
bits for the integer part of the LNS exponent and 18 bits for
its fractional part.
The included multiplexer has the function of shifting the
normalized mantissa depending on the value of product bit
17.
Figure 1, for simplicity reasons, only shows the first
stage of the pipeline.
After the last stage of the pipeline, numerical mantissa
is disregarded, as only values included in the ZERO and
EXPONENT registers are required as result.
It is to be mentioned now that, with the use of a Virtex
II Pro device, results obtained must be considered as very
good, as the final design only used 1% of the chip
resources. As regards to speed, conversion speed is very
high, because of the pipelined architecture which allows
one conversion per clock pulse.
In a near future, it is planned to install this same device
in a Spartan III family device, which is much cheaper than
the Virtex II we considered. The Spartan family has a


Fig. 2. Converter block diagram (one stage).
limited quantity of multipliers, which can limit the results
of the proposed algorithm.
Anyway, considering this situation and considering also
the fact that successive conversion stages are oriented to
obtain bits at the fractional part of the exponent with less
and less weight in the final result, it is a future objective to
design this conversion system based on variable length
multipliers, in order to simplify the need of large
calculating units.
Preliminary tests, which are not ready to be included in
this paper, have shown a space economy of around 30%
only in the multiplying pipeline stages.
5. ACKNOWLEDGEMENT
Authors wish to acknowledge to the members of the Signal
and Images Processing Research Group of our University,
Mr. Roberto De Paoli and Mr. Luis Fernndez, for all their
knowledge and support received during our research work.
Many of the results obtained in this project would have
been impossible to reach without their help.
6. CONCLUSION AND RESULTS
This paper refers to a proposed algorithm for converting
fixed point numbers into the logarithmic number system
(LNS) and the way of installing the developed algorithm on
a FPGA Virtex II Pro.
Fig. 1. Converter block diagram.
41
The developed project considers an iterative algorithm
as base for the conversion process, to allow a simple
solution using a pipelined multiplication system, with no
requirements for external memories, huge size tables or
interpolation methods, usual in other conversion methods.
As can be seen in simulation results, shown in table 1,
the algorithm uses a very small part of the hardware
resources included in the FPGA.
The pipelined design will allow for one new conversion
completed at every clock cycle of the circuit.
This performance allows considering the use of the
developed system for high speed applications, such as
digital signal processing, audio and video applications,
among others.
Data inputs have been considered to be not larger than
16 bits, so as to consider compatibility with current AD
converters that could be used together with this application.
The use of the FPGA internal multipliers, which are 18 bit x
18 bit multipliers, allow results to be very accurate.
In fact, testing processes included converting every
fixed point number in the 16 bit rate into LNS, using the 18
bits multipliers in two different ways.
In the first test, the internal multipliers in the FPGA
were used as 16 bits x 16 bits multipliers. In the second test,
they were used at their complete 18 bit range, even the
input data was represented in 16 bits.
Results obtained from the fixed point to LNS converter,
were again converted into fixed point numbers, for the
entire 16 bit number range. Results are shown in table 1.
This table shows that the absolute error obtained at the
conversion is never higher than 1.355, obtained when
converting integer number 55185. This implies a relative
error of around 24.5 part per million. When the 18 bits
multipliers are used, absolute error falls to 0.428857,
obtained converting integer number 55516. In this case,
relative error is less than 8 ppm.
Also considering the fact, already mentioned, that the
converter system allows for one conversion for each clock
cycle, after an initial delay of 18 clocks, it is clearly shown
that the developed converter is a high speed conversion
system, suitable for real time applications, and with no need
for RAM blocks or huge tables to be memorized, as in other
converters.
This will allow the converter to be developed on one
FPGA device, using only the hardware resources included
in the device.
The lack of need for external elements gives, as a result,
a compact design, available on well known commercial
FPGA devices.
Table 2 shows the hardware resources used from the
Virtex Pro FPGA. The abovementioned results show that
most of the hardware resources included in the FPGA have
not been used and are, therefore, still available.
Simulation Using 16 bit Using 18 bit
results multipliers multipliers
Maximum 1.355 0.428857
absolute error
Obtained at 55185 55516
number

Maximum 28.5 ppm 8.77 ppm
relative error
Obtained at 8237 46425
number


Table 2. Device utilization Summary

Logic
utilization
Used Available Utilization
Slices 223 13696 1%
Multipliers 18 136 13%
Multiplexers 1 16 6%
4 Input Luts 354 27392 1%
Slice flip flops 329 27392 1%

In this connection, a further stage of this research will
take the converter to be installed in a Spartan III device.
Even considering that the Spartan III does not include so
many multipliers, results can still be useful for simpler
applications that do no require very high speed and
precision.
7. REFERENCES
[1] IEEE 754-2008 IEEE Standard for floating point arithmetic.
IEEE Computer Society. Jun. 2008
[2] R. Matousek, M. Tichy, Z. Pohl, J. Kadlec, C. Softley and N.
Coleman, Logarithmic Number System and Floating Point
Arithmetic in FPGA, Lecture Notes in Computer Science,
Springer Berlin, ISSN 0302-9743, Vol 2438/2002.
[3] Michael Haselman, Michael Beauchamp, Aaron Wood,
Scott Hauck, Keith Underwood, K. Scott Hemmert, "A
Comparison of Floating Point and Logarithmic Number
Systems for FPGAs,", 13th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines
(FCCM'05), pp.181-190, 2005.
[4] Detrey, J.; de Dinechin, F., "A VHDL library of LNS
operators," Signals, Systems and Computers, 2003.
Conference Record of the Thirty-Seventh Asilomar
Conference on , vol.2, no., pp. 2227-2231 Vol.2, 9-12 Nov.
2003
[5] Virtex II platform FPGA handbook.. Xilinx Inc. 2000.


Table 1. Conversion error results

42
HARDWARE CO-PROCESSING UNIT FOR REAL-TIME SCHEDULING ANALYSIS
Jos Urriza, Ricardo Cayssials
1
, Edgardo Ferro
Universidad Nacional del Sur CONICET
1
Department of Electrical Engineering
Baha Blanca Argentina
email: iecayss@criba.edu.ar
ABSTRACT
1

In this paper we describe the design and the
implementation of a co-processing unit for real-time
scheduling analysis. This unit implements an arithmetic
architecture that determines the schedulability of a fixed-
priority discipline without requiring processing time from
the system processor.
Fixed-priority discipline is one of the most important
disciplines in real-time. In this discipline, a priority is
assigned to each task and it remains fixed during runtime.
Exact schedulability conditions are useful to determine if
the real-time requirements can be met. However, when the
schedulability is required to be determined during runtime,
the complexity of the calculus requires so much processing
time that makes the system unfeasible.
The processing unit implements efficiently the real-time
analysis of a set of real-time tasks scheduled under a fixed-
priority discipline and can be used in different real-time
areas.
1. INTRODUCTION
In the classical definition ([1]), Real-Time Systems
(RTS) are those in which results must be not only correct
from an arithmetic-logical point of view but also produced
before a certain instant, called deadline.
Hard real-time systems are those in which no deadline
can be missed. In the hard real-time systems, missing the
deadline of a task may have severe consequences and
catastrophic results. Schedulability analyses were proposed
to determine that all deadlines will be met during runtime.
If the schedulability analysis is successful, it is said that the
system is schedulable, otherwise the system is non-


This work was supported by the Technological Modernization
Program under Grant BID1728/OC-AR-PICT2005 Number
38162. and the project Digital processing platfrom for Active
Power Line Filters granted by Fundacin Hermanos Agustn y
Enrique Rocca.


schedulable and consequently some deadlines may be
missed.
In [1], the schedulability of single processor and
multitasking systems is considered. A priority discipline
establishes a linear order on the set of tasks, allowing the
scheduler to define at each instant of activation, which task
will use the shared processor.
Usually, it is considered that the tasks are periodic,
independent and appropriable. A periodic task is one that
after a certain time requests execution. The task is said to
be independent if it does not need the result of the
execution of some other task for its own execution. Finally,
it is said that the task is appropriable when the scheduler
can suspend its execution and withdraw it from the
resource at any time.
Generally, the parameters of each task, under this
framework, are: its execution time, which is noted as C
i
, its
period; noted T
i
, and its deadline, D
i
. So, a real-time
system is specify by a set of n tasks, S(n), such that
S(n)={(C
1
, T
1
, D
1
), (C
2
, T
2
, D
2
),..., (C
n
, T
n
, D
n
)}.
Numerous schedulability tests have been proposed in
the real-time literature ([4, 5, 6, 7]). In 1986, Joseph and
Pandya ([8]) presented an iterative method of Fixed Point
to evaluate a necessary and sufficient condition to validate
the feasibility of a RTS using a fixed-priority scheduler.
Several works have been published with equivalent
solutions afterwards. In 1998, Sjdin ([3]) makes an
improvement to Josephs test, which is to begin the
iteration of the task i+1 at the moment when the method of
Joseph found the worst case response time of task i plus
the execution time of task i+1. In 2004, Bini ([2]) proposed
a new method called Hyperplanes Exact Test (HET) to
determine the schedulability of an RTS.
All these proposals try to improve the efficiency on the
schedulability analysis since it is the base of several real-
time areas: aperdiodic task servers, fault tolerance
computing, slack stealing techniques, multitask-
multiprocessor assignment problem, among others. Several
of the mechanisms proposed are unfeasible to be
implemented during runtime because of the processing
time that the schedulability analysis demands.
In this paper, it is proposed a co-processing unit for the
real-time scheduling analysis of real-time system under a
43
fixed-priority discipline. This co-processor unit can be
included in different architectures since it was designed
with a general memory interface. The high performance of
the unit makes it useful to implement on-line real-time
strategies.
This paper is organized as follows: Section 2 describes
the main concepts in real-time scheduling analysis. Section
3 explains the arithmetic architecture proposed to solved
the fixed point schedulability function. In Section 4, we
describe the data structure to interface the arithmetic
structure with the processor of the system using the main
memory as interface. Section 5 shows the results obtained.
In Section 6, we describe the target applications in which
this architecture may be applied. Conclusions are drawn in
Section 7.
2. REAL-TIME SCHEDULING ANALYSIS
Real-time scheduling analyses are proposed in order to
determine the schedulability of real-time systems. The
scheduling analysis depends on the priority discipline
considered. Two of the most important priority disciplines
in real-time are: Earliest Deadline First and Fixed Priority.
Almost all practical real-time operating systems implement
fixed priority schedulers and consequently it is important
to get efficient scheduling analysis strategies for this
discipline.
Since 1986, most of the schedulability tests developed
for fixed-priority disciplines are based on applying fixed
point methods to guarantee the schedulability of the real-
time system.
By definition, a fixed point of a function f is a number t,
such that t= f(t). In our case, the fixed point function is a
function of time and consequently the t point and the
instant t are equivalent expressions.
The first fixed point method to determine the
schedulability of a real-time system under a fixed priority
discipline was developed by Joseph and Pandya ([8]). In
[8], it is proved that there is not an analytical construction
to resolve such kind of problems and it is only possible to
solve it by iterative calculations.
Josephs method is initialized in the critical instant in
which all the tasks are simultaneously invoked. As shown
in [8], the result is the Worst Case Response Time of task
i, denoted W
i
, of a subset of tasks S(i). The fixed point
equation proposed in [8] is:
1
1
1
i
q
q
i j
j j
t
t C C
T

+
=

= +

(1)
The fixed point given by this equation, if it exists, is the
Worst Case Response Time of the task i. Consequently,
task i is schedulable if, starting in t
0
= C
i
+ W
i-1
, there exists
a fixed point ( t
q+1
= t
q
) and the Worst Case Response Time
of task i is lower than or equal to its deadline (t
q
D
i
).
Otherwise, the task i is non schedulable and consequently
the real-time system is non schedulable as well.
The Utilization Factor of a real-time task is defined as
C
i
/T
i
. The total utilization factor of the real-time system is
therefore defined as the summation of the utilization factor
of the tasks of the real-time system. It is a necessary
condition for schedulability that the total utilization factor
be less than or equal to 1.
Several methods have been proposed to solve this fixed
point function. All of them start with an initial value of t
and iterate until reaching the fixed point. The complexity
of these methods for a real-time system with n task is
proportional to n
2
.max(T). This complexity may turn
unfeasible the scheduling analysis during runtime because
of the processing time required to find the fixed point.
In this paper we propose a hardware processing unit
that solves this fixed point function without requiring
processing time from the processor of the system and
consequently without perturbing the execution of the real-
time tasks.
3. ARITHMETIC ARCHITECTURE
The arithmetic architecture proposed resolves the
schedulability condition given in Eq. 1. A fixed point
function requires an iterative method to find a solution.
Several methods were proposed in order to find the lesser
number of iterations to converge to the final solution.
Finding a solution trying each value of t for 0 to T
i
may
lead to a great number of iterations and consequently
producing a very time consuming method.
The simplest and more trivial iteration method begins
with t
0
=0, calculates the function to get the next value of t
and ends when the t gotten is equal to the t utilised in the
calculus. This iteration mechanism is valid because the
fixed point function is proven to be monotonic. Figure 1
shows the arithmetic architecture proposed to perform the
calculus of Eq. 1.
All the architecture requires is a synchronization to
produce the values C
j
and T
j
each time it is necessary. This
synchronization is in charged of a sequential machine that
picks up these values from the memory of the system and
transfers them to the respective registers. The sequential
machine initiates the accumulator in zero as well as
determines the end of the iterations.
From Fig. 1, it can be noted that the time required for
the calculus depends on the time required for the integer
division first and for the integer multiplication later. The
throughput can be easily increased with a pipeline that
contains in the first stage the integer divider and the integer
multiplier in the second stage.
44
4. DATA STRUCTURE
There exists a great deal of information involved in the
schedulability analysis. This information contains all the
parameters of the real-time tasks and may be modified or
accessed by the processor of the system. The arithmetic
architecture requires accessing to the real-time parameters
efficiently and without perturbing the execution of the real-
time tasks of the system. For this reason we implemented a
memory accessing unit with indirection capabilities to
access to the different parameters of the real-time tasks.
Integer Multiplier

t
q
Tj

Integer Divider
Num. Den.
Result Remain

Comparator
0
A B
=
0 1
1 0
+
Cj
+
Accumulator
+
Ci
t
q+1

Fig. 1. Arithmetic architecture proposed

The information was structured in order to deal with all
the different target applications that may require a
schedulability analysis. The data structure implements
indexes to access to the real-time parameters of the tasks as
well as to store the results of the analysis. The data
structure of the real-time system (Table 1) stores the
number of the tasks, the index to the data structure of the
highest priority task and the results of the scheduling
analysis. There exist a memory address to command the
beginning of the scheduling analysis and another memory
address in which the scheduling unit indicates that the
analysis has ended.
The data structure of each real-time task (Table 2)
stores the real-time parameters, the index to the next
highest priority task and the result of the scheduling
analysis for the task.

Table. 1. Data Structure of the Real-Time System
n : number of tasks
Index to highest priority task
System Schedulability: True or
False
Start analysis: True or False
End of Analysis: True or False

Table. 2. Data Structure of the Real-Time System
C: Worst Execution Time
T: Period
D: Deadline
W: Worst Case Response Time
Schedulability: True or False
Index to the next highest priority
task

This data structure allows an easy communication with
different target applications that requires an efficient on-
line real-time scheduling analysis. Changes on real-time
parameters of the tasks, the number of real-time task of the
system and the priority of the tasks can be easy produced
during runtime without any change on the arithmetic
architecture proposed. Moreover, this data structure can be
shared among the arithmetic architecture and the real-time
processor or with others specific hardware units.
5. EXPERIMENTAL RESULTS
The unit was synthesized for an APEX device. It
required 276 LC for a 16 bit implementation with the
divider and the multiplier parameterised for 16 clock
cycles.
The architecture proposed was tested using several real-
time systems randomly generated. From the experiments
could be noted that the number of iterations, and
consequently the time required to find the fixed point,
depends on the magnitude of the different parameters of
the real-time tasks. However, the number of iterations
remains always bellow of the theoretical complexity of the
schedulability equation 1 (n
2
.max(T)).
Of course, this complexity analysis was already
performed by the authors of the iteration method
implemented in the architecture proposed. However, we
have to consider that the time required to find a solution
using the arithmetic architecture proposed is measured in
periods of clock whilst an algorithm implemented on a
processor is measured in number of arithmetic instruction,
each one of them requiring several periods of the system
clock. Moreover, the scheduling analysis may be improved
choosing faster divider and multiplier units. The difference
of time required between both measurements is several
orders of magnitude which could be the difference between
45
the feasibility of implementing the schedulability analysis
on-line or not.
6. TARGET APPLICATIONS
Several applications in real-time system are based on an
efficient implementation of a scheduling analysis
technique. The scheduling analysis is utilised to decide
what actions should be taken in the future. However, if the
time required to produce a result is long enough to turn that
future in past, then the analysis has non sense. The lack of
an efficient on-line scheduling analysis is the main reason
that makes that most of the different real-time mechanisms
cannot be implemented during runtime. Some of the target
applications for this architecture may be:
Dual Scheduling: the execution time of a processor is
share among two or more schedulers running real and non-
real-time tasks. When the deadline of a real-time task
cannot be met, then the task has to be assigned to another
scheduler or the processor time assigned to the scheduler
has to be increased.
Slack Stealing methods: utilise the idle processing
time that leave the real-time tasks to execute non-real-time
tasks. In order to improve the response time of the non-
real-time tasks, it is worth to postpone the execution of the
real-time tasks as much as possible.
Adaptive Scheduling: is used when the period of a
real-time task can be modified in order to change the total
utilization factor of the system. In this way, the utilization
factor of each real-time task may be adapted to the current
load of the system in order to make it schedulable.
Dynamic task assignment: allows assigning real-time
tasks during runtime. The real-time system has to
guarantee that the schedulability will be not affected.
Fault Tolerance mechanisms: based on the execution
of different tasks to produce the same results in order to
compare them. The scheduler has to guarantee that there
will be enough time to execute all the versions of the real-
time tasks without jeopardise the schedulability of the real-
time system.
Flexible real-time systems: are those in which the
missed deadlines are bounded and restricted to a certain
pattern. The schedulability analysis has to be done in order
to guarantee that the temporal constraints of the real-time
tasks will be satisfied.
Dynamic Voltage Scheduling: is a strategy that
modified the voltage/frequency of the processor in order to
reduce the power consumption. The techniques applied
have to guarantee that temporal constraints will be satisfied
with the minimum power consumption possible.
These are some of the applications that can be feasible
if an efficient scheduling analysis is implemented. The
architecture proposed improves in several orders of
magnitude the performance of the scheduling analysis
implemented through a software algorithm executed by the
same processor in which the real-time tasks are also
executed.
7. CONCLUSIONS
In this paper we presented a hardware architecture to
solve a fixed point function. This fixed point function is an
exact schedulability condition to guarantee the
schedulability of a real-time system.
Several applications were detailed and most of them
were not suitable to be implemented during runtime
because of the complexity of the method to solve the fixed-
point schedulability function. This hardware architecture is
intended to be utilised in real-time systems during runtime.
In this paper we proposed an adequate data structure to
share the real-time information with the system processor.
This memory-based interface makes the scheduling
analysis architecture adaptable to different processors
without perturbing the execution of the real-time tasks.
8. REFERENCES
[1] J. A. Stankovic, "Misconceptions About Real-Time
Computing: A Serius Problem for Next-Generations
Systems," IEEE Computer, vol. Octubre, pp. 10-19, 1988.
[2] E. Bini and C. B. Giorgio, "Schedulability Analysis of
Periodic Fixed Priority Systems," IEEE Trans. on
Computers, vol. 53, pp. 1462-1473, November 2004.
[3] M. Sjdin and H. Hansson, "Improved Response-Time
Analysis Calculations," in IEEE 19th Real-Time Systems
Symp., 1998, pp. 399-409.
[4] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing,
D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, Tulika
Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P.
Stenstrm, "The Worst-Case Execution Time Problem
Overview of Methods and Survey of Tools," Mlardalen
University March 2007.
[5] A. Burchard, J. Liebeherr, Y. Oh, and S. H. Son, "New
Strategies for Assigning Real-Time Tasks to Multiprocessor
Systems," IEEE Trans. on Computers, vol. 44, pp. 1429-
1442, 1995.
[6] C. C. Han, "A Better Polynomial-Time Schedulability Test
for Real-Time Multiframe Tasks," in IEEE 19th Real-Time
Systems Symposium, 1998, pp. 104-113.
[7] T.-W. Kuo, Y.-H. Liu, and K.-J. Lin, "Efficient Online
Schedulability Tests fo Real-Time Systems," IEEE
Transactions on Software Engineering, vol. 29, pp. 734-751,
August 2003.
[8] M. Joseph and P. Pandya, "Finding Response Times in Real-
Time System," The Computer Journal (British Computer
Society), vol. 29, pp. 390-395, 1986.
46
IMPLEMENTAO EM HARDWARE DO MTODO DE MINKOWSKY PARA O
CLCULO DA DIMENSO FRACTAL
Maximiliam Luppe


Departamento de Engenharia Eltrica / Escola de Engenharia de So Carlos
Universidade de So Paulo
Av. Trabalhador so-carlense, 400 So Carlos SP Brazil 13566-590
email: maxluppe@sc.usp.br
ABSTRACT
A Dimenso Fractal uma ferramenta extremamente
importante na caracterizao e anlise de formas, incluindo
tarefas que vo desde o processamento de sinais at ptica.
Uma das razes para o interesse to grande o seu poder
de expressar corretamente a complexidade e auto-
semelhana dos sinais. Alm disso, a dimenso fractal
tambm pode ser entendida como uma indicao de
cobertura espacial de uma forma especfica. Diversas
abordagens numricas foram desenvolvidas para o clculo
da dimenso fractal como, por exemplo, Box Counting,
amplamente utilizado. No entanto, a sua aplicao a dados
reais no apresentam resultados to bons como o mtodo
de Minkowsky. O mtodo de Minkowsky, utilizado para o
clculo da dimenso fractal, envolve uma srie de
dilataes do formato original, com relao a vrios raios,
definindo a escala espacial. A rea das vrias dilataes
representada em termos dos raios em um grfico log-log e
a dimenso fractal tomada como sendo 2 - (inclinao da
reta interpolada). As dilataes, normalmente obtidas a
partir de operaes morfolgicas, tambm podem ser
obtidas por meio da Transformada de Distncia Euclidiana
(TDE). A TDE calcula a distncia mnima entre um pixel
de fundo e forma. As dilataes so obtidas por meio da
limiarizao da imagem gerada pela TDE em todas as
distncias possveis que podem ser representadas na
imagem. Neste trabalho apresentada uma proposta de
implementao de hardware dedicado, com base na TDE,
do mtodo de Minkowski para o clculo da dimenso
fractal, e possvel de ser implementado em dispositivos
reconfigurveis.
1. INTRODUO
A principal caracterstica do fractal [1] est relacionada
sua dimenso, com o qual possvel determinar o grau de
complexidade de uma linha ou a rugosidade de uma
superfcie, ou, de acordo com Russ [2], a dimenso fractal
a taxa com a qual o permetro (ou rea de uma superfcie)
de um objeto aumenta quando a escala de medida
reduzida. As aplicaes da dimenso fractal esto
relacionadas anlise de formas, tal como a anlise de
formas de neurnios [3], estudo de infiltrao em solos [4],
anlise de rugosidade [5], entre outras.
De acordo com Allen et al. [6], as estratgias de anlise
para medir a dimenso fractal podem ser divididas em dois
grupos: mtodos baseados em vetores e mtodos baseados
em matrizes. Dentre os mtodos baseados em vetores,
temos o mtodo Structured Walking. Dentre os mtodos
baseados em matrizes, temos os mtodos de Contagem de
Caixas (Box Counting) e de Mapa de Distncia, tambm
conhecido por mtodo da salsicha de Minkowski. Alm
destes, outros mtodos para o clculo da dimenso fractal
foram criados [7] e avaliados [3], [4], [8]. Apesar de o
mtodo da salsicha de Minkowski ser considerado o mais
preciso e o menos sensvel a rudos e a rotaes, de acordo
com Brub e Jbrak [8] e Allen et al. [6], pouco tem sido
utilizado. Tem-se dado preferncia ao mtodo de
Contagem de Caixas, devido principalmente sua
facilidade de implementao.
Uma das formas de se implementar o mtodo da
salsicha de Minkowski atravs do uso da Transformada
de Distncia [9], em especial a Transformada de Distncia
Euclidiana (TDE). Neste trabalho abordaremos o uso de
uma arquitetura para o clculo da TDE em tempo real,
proposta em [10], para calcular a dimenso fractal pelo
mtodo da salsicha de Minkowski.
Na seo 2 feita uma breve descrio da metodologia
para o clculo da dimenso fractal baseada no mtodo da
salsicha de Minkowski utilizando um Mapa de Distncias.
Na seo 3 detalhada a implementao da arquitetura e
na seo 4 so apresentadas as concluses.
2. CLCULO DA DIMENSO FRACTAL
O mtodo do Mapa de Distncias para o clculo da
dimenso fractal baseia-se no uso de um processo
conhecido por salsicha de Minkowski. Neste processo,
cada ponto que pertence ao contorno do objeto dilatado
ou sobreposto por crculos de raio (figura 1), formando
faixas, ou bandas, mais conhecidas por salsichas, cuja
rea A() proporcional a
(2-D)
[8], sendo D a dimenso
Apoiado pela FAPESP (2007/04657-3)
47
fractal. A dimenso fractal obtida por este mtodo mais
conhecida por Dimenso Minkowski-Boulingand.


Figura 1 Exemplo do mtodo de Minkowski

Limiarizando-se um Mapa de Distncias (criado a
partir da TDE aplicada aos pontos de contorno do objeto)
em diferentes nveis de cinza, ou valores de H, criamos
faixas similares quelas obtidas pelo processo da salsicha
de Minkowski. Assim podemos obter a dimenso fractal
por meio de um grfico do logaritmo da rea A(H) pelo
logaritmo do raio H, que resultaria numa reta cuja
inclinao igual a 2-D.
Desta forma, podemos obter a dimenso fractal
calculando o coeficiente angular da reta de um grfico do
logaritmo da rea A(H)pelo logaritmo do raio H. Uma
forma de obter o valor da rea em funo do raio por
meio do uso de um histograma acumulativo da imagem.
Este histograma fornece, para cada nvel de cinza, ou valor
de H, a quantidade de pixels com valores menores ou iguais
a este. Sendo assim, o mdulo para o clculo da Dimenso
Fractal necessitaria da implementao de uma estrutura
para obteno do histograma do mapa de distncias, de
uma estrutura para o clculo de logaritmo e de outra para o
clculo de coeficiente angular. O grfico de rea por raio
obtido atravs do histograma acumulativo, onde para cada
valor de raioH so considerados todos os pontos do mapa
de distncia com raio menor ou igual a H.
3. IMPLEMENTAO
O histograma de uma imagem obtido pela contagem do
nmero de pixels que possui um determinado valor de
distncia (ou raio), para cada valor de distncia presente no
mapa de distncias. Duas so as possibilidades de se
implementar esta estrutura: com contadores individuais
(um para cada valor de distncia), ou com memria. O uso
de registradores, apesar de ser mais simples, implica em
um grande consumo de clulas lgicas, enquanto que com
a implementao com memria acarretaria no consumo dos
blocos de memria j existentes na FPGA e de apenas
algumas clulas lgicas. Sendo assim, optou-se pelo uso da
memria.
Para implementar uma contagem de pixels utilizando
memria, realizado um esquema read-modify-write.
Neste esquema, uma determinada posio de memria
acessada utilizando o valor do nvel de cinza do pixel. O
valor armazenado nesta posio lido, incrementado e
armazenado na mesma posio. A melhor forma de
implementar este esquema utilizando memria sncrona
com duas portas e uma tcnica conhecida por Clock-2x.
Nesta tcnica, o sinal de clock utilizado para ter acesso
memria dobrado, enquanto que o sinal de clock
utilizado para habilitar a escrita na memria. O dado
disponvel numa das portas incrementado por meio de
um somandor e o resultado da soma fornecido outra
porta. Na figura 2 temos o esquema da implementao da
tcnica.


Figura 2 Mdulo de Histograma

Durante o processo de gerao do histograma, o valor
do pixel utilizado como endereo do dado a ser lido,
incrementado e armazenado. Para a descarga do
histograma, um contador externo utilizado para a gerao
dos endereos para a leitura dos dados e conseqente
limpeza (armazenamento do valor 0) da memria. Para esta
operao so utilizados multiplexadores controlados pelo
sinal sel.
Para se obter o histograma acumulativo, basta
descarregar o contedo da memria num circuito
acumulador. Este acumulador vai armazenando, e
enviando adiante, o resultado da soma anterior com o novo
dado. Cada dado do histograma acumulativo representa a
rea A(H)coberta pelos pixels para cada valor de distncia
H representvel na imagem pelo mapa de distncias.
Uma vez calculado o histograma acumulativo,
necessrio calcular o logaritmo dos dados deste para obter
os dados necessrios para gerar o grfico log A(H)vs log H
e, a partir deste, obter o coeficiente angular da reta que
melhor se ajusta aos dados. Para se obter o logaritmo da
rea necessrio converter os dados, que esto no formato
de inteiros, para o formato em ponto flutuante. Tal
converso realizada pela megafuno ALTFP_CONVERT
da ferramenta Quartus II, que tanto pode ser utilizado para
converter dados inteiros (32 ou 64 bits) para ponto-
flutuante (preciso simples ou dupla), como vice-versa.
Para tanto, foram criados os componentes
i2048log_i2f e i2048log_f2i para a converso de
inteiro para ponto-flutuante e de ponto-flutuante para
inteiro, respectivamente.
Aps a converso de inteiro para ponto-flutuante,
calculo do logaritmo (com o uso do componente
i2048log_log, criado a partir da megafuno
ALTFP_LOG) e o resultado multiplicado por 2048 (com
o uso do componente i2048log_mul, criado a partir da
48
megafuno ALTFP_MULT) antes de ser novamente
convertido para inteiro. A multiplicao por 2048 permite
continuar com o uso de inteiros no lugar de ponto-
flutuante, o que reduz o consumo de lgica. Na figura 3
temos o esquema do circuito de converso-logaritmo-
multiplicao.


Figura 3 Mdulo de Logaritmo

Da mesma forma, tambm foi obtido o logaritmo dos
valores de distncia H, multiplicados por 2048 e
armazenados numa memria ROM (com o uso do
componente dimfrac_rom, criado a partir da
megafuno ROM: 1-PORT) para serem utilizados,
juntamente com os valores de log A(H) para o clculo do
coeficiente angular. Assim, aps a gerao do histograma,
armazenado numa memria RAM, possvel calcular a
dimenso fractal enviando os dados de rea e distncia
para um mdulo de clculo do coeficiente angular. Tanto a
memria RAM (com os dados do histograma), quanto a
memria ROM (com os dados de distncia) so acessados
ao mesmo tempo por meio de um contador que realiza a
descarga das duas memrias. Para a sincronizao dos
sinais Yi (log A(H)) e Xi (log H), uma vez que o caminho
percorrido pelos dados vindos do histograma maior,
devido ao processamento realizado nestes, foi necessrio
ainda realizar um atraso no sinal de Xi por meio do uso do
componente dimfrac_shr, criado a partir da
megafuno Shift register (RAM-based).
Para o clculo do coeficiente angular utilizado o
Mtodo do Mnimo Quadrtico (MMQ). Considerando um
conjunto de pares ordenados (x
i
, y
i
) que descrevem uma
reta y = a + bx, podemos encontrar os coeficientes a e b
por meio das seguintes operaes:

(1a)
(1b)

Como estamos interessados apenas no coeficiente
angular b, visto que D = 2 coeficiente angular do grfico
log A(H) vs log H, no foi realizado o clculo do coeficiente
linear a. Visto que estamos trabalhando com um conjunto
fixo de distncias, visto que poucas so as distncias
representveis e so limitadas pelo tamanho da imagem,
podemos observar que alguns parmetros da equao do
coeficiente angular so constantes (dependem apenas da
varivel x
i
), alm do prprio n:

(2a)
(2b)

Sendo assim, a equao para o coeficiente angular fica:

(3)

O que reduz o clculo do coeficiente angular de quatro
multiplicaes, duas subtraes e uma diviso, para apenas
duas multiplicaes, uma subtrao e uma diviso. Todas
as operaes so realizadas com nmeros inteiros. Tanto a
somatria de y
i
, quanto a somatria do produto de x
i
com
y
i
, so obtidas da mesma forma que o histograma
acumulativo, por meio de um componente acumulativo:
um acumulador (ACC) para a somatria de y
i
e um
multiplicador acumulador (MAC) para a somatria do
produto de x
i
com y
i
. O resultado destes somatrios
enviado ao componente MMQ_cte, apresentado na figura
4.


Figura 4 Mdulo de MMQ

Na figura 5 apresentado o esquema geral do mdulo
para o clculo da dimenso fractal em tempo real. Em azul
est representada a largura dos barramentos de dados e,
entre parnteses, o atraso causado por cada mdulo, em
ciclos de clock.


Figura 5 Esquema geral do Mdulo de Dimenso Fractal
49
Na figura 6 temos o esquema geral do mdulo para o
clculo da dimenso fractal, j com o sistema de controle.


Figura 6 Mdulo de Dimenso Fractal

Na figura 7 temos dois exemplos de fractais,
conhecidos por curvas de Koch, utilizados como teste. Em
a) e c) temos a imagem original, em b) e d) temos o clculo
da TDE. O fractal da figura 7a) tem dimenso fractal de
1,5000, e o da imagem figura 7c), de 1,4626.


a) b)

c) d)
Figura 7 Exemplos de processamento da Transformada
de Distncia

Nas imagens seguintes temos a simulao da
arquitetura para o clculo da dimenso fractal, no caso da
imagem da figura 7a. Na figura 8 temos o incio do clculo
da dimenso fractal, aps ter sido calculada a transformada
de distncia. Podemos observar a gerao do endereo de
acesso das memrias (sinal counter) para leitura dos
dados RAM_Yi e ROM_Xi. Na mesma imagem podemos
ver a gerao do histograma acumulativo, por meio do
sinal ACC_Yi.
Na figura 9 temos o incio da gerao dos sinais para o
clculo do MMQ. Observamos os sinais Yi (aps o clculo
do logaritmo) e Xi (aps passar pela linha de atraso), alm
dos somatrios Sx e Sxy. Na figura 10 temos o final do
processamento, onde temos o resultado final do MMQ,
com os sinais Quocient e Remain. Como o resultado
esperado para a dimenso fractal para linhas entre 1,0000
e 2,0000, e o clculo da dimenso fractal por meio da
transformada de distncia retorna 2 D, o resultado final
da diviso realizada no mdulo MMQ ser com o
quociente igual a zero e com o resto igual a 2 D.


Figura 8 Incio do processamento leitura do histograma


Figura 9 Incio do processamento gerao dos
somatrios


Figura 10 Fim do processamento gerao do
coeficiente angular

Abaixo temos a tabela 1 com os valores mdios obtidos
com o processamento de alguns fractais e os valores
tericos para os mesmos. Valores menores que os tericos
eram esperados, pois os valores da dimenso fractal
50
obtidos pelo mtodo da salsicha de Minkowski so
menores que os valores tericos.

Curva Terico Obtido
Line 1,0000 0,974
Contour of the Gosper island 1,0686 1,040
Vicksel Fractal 1,4650 1,260
Quadratic von Koch curve type 1 1,4650 1,180
Peano curve 2,0000 1,754
Tabela 1 Resultados obtidos
4. CONCLUSO
Finalmente, os mdulos para o clculo da dimenso fractal
e para o processamento em tempo real foram plenamente
implementados e permitem tanto a visualizao da
transformada de distncia, quanto do valor da dimenso
fractal do fractal. Algumas melhorias ainda necessitam ser
realizadas como a implementao de um mdulo de
binarizao para melhor gerao dos dados para o clculo
da transformada de distncia (o que possibilitaria o uso da
arquitetura para o clculo de texturas), alm de um mdulo
de deteco de bordas, que permitiria o uso da arquitetura
para determinar a rugosidade do contorno de objetos.
Para os resultados finais, foi possvel realizar a
transformada de distncia com raio igual a 35, o que
representou 397 distncias distintas. Todo o sistema
consumiu 56279 elementos lgicos de uma FPGA Cyclone
II 2C70 da Altera.
5. REFERNCIAS
[1] Mandelbrot, B. B., The Fractal Geometry of Nature, W. H.
Freeman, New York, 1983.
[2] Russ, J. C., The Image Processing Handbook 2
nd
ed., CRC
Press, New York, 1995.
[3] Jelinek, H. F., Fernandez, E., Neurons and fractals: how
reliable and useful are calculations of fractal dimensions?,
Journal of Neuroscience Methods, v. 81, pp. 9-18, 1998.
[4] Ogawa, S., Baveye, P., Boast, C. W., Parlange, J. Y.,
Steenhuis, T., Surface fractal characteristics of preferential
flow pattern in field soils: evaluating and effect of image
processing, Geoderma, v. 88, pp. 109-136, 1999.
[5] Hyslip, J. P., Vallejo, L. E., Fractal analysis of the
roughness and size distribution of granular materials,
Engineering Geology, v. 48, pp. 231-244, 1997.
[6] Allen, M., et al, 1995, Measurement of boundary fractal
dimensions: review of current techniques, Powder
Technology, v. 84, n.1, pp. 1-14, 1995.
[7] Asvestas, P. et al, Estimation of fractal dimension of
images using a fixed mass approach, Pattern Recognition
Letters, v. 20, pp. 347-354, 1999.
[8] Bbur, D., Jbrak, M., High precision boundary fractal
analysis for shape caracterization, Computer & Geociences,
v. 25, pp. 1059-1071, 1999.
[9] Rosenfeld, A.; Pfaltz, J. L., Distance Functions on Digital
Pictures, Pattern Recognition, v.1, pp. 33-61, 1968
[10] Luppe, M. Colombini, A. C., Roda, V. O.; Arquitetura para
Transformada de Distncia e sua Aplicao para o Clculo
da Dimenso Fractal. Terceira Jornadas de Engenharia de
Electrnica e Telecomunicaes e de Computadores
JETC05, Lisboa, Portugal, November 17-18, 2005.
51

52
AN ENTRY-LEVEL PLATFORM FOR TEACHING HIGH-PERFORMANCE
RECONFIGURABLE COMPUTING
Pablo Viana, Dario Soares, Lucas Torquato
LCCV - Campus Arapiraca
Federal University of Alagoas
ABSTRACT
Among the primary difculties of carrying out integration of
digital design prototypes into larger computing systems are
the issues of hardware and software interfaces. Complete
operative systems and their high-level software applications
and utilities differs a lot from the low-level perspective of
hardware implementations on recongurable platforms. Al-
though both hardware and software development tools are
more and more making use of similar and integrated envi-
ronments, there is still a considerable gap between program-
ming languages running on regular high-end computers and
wire-up code for conguring a hardware platform. Such
a contrast makes digital design too hard to be integrated
into software running on regular computers. Additional is-
sues include programing skills at different abstraction lev-
els, costly platforms for recongurable computing, and long
learning curve for using special devices and design tools.
Hence coming up with an innovative high-performance re-
congurable solution, besides its attractiveness, becomes a
difcult task for students and non-hardware engineers. Thus
we propose a low-cost platform for attaching an FPGA de-
vice to a personal computer, enabling its user to easily learn
to develop integrated hardware/software designs to acceler-
ate algorithms for high-performance recongurable comput-
ing.
1. INTRODUCTION
Recent discovery of huge oil and gas volumes in the pre-salt
reservoirs of Brazils Santos and Carioca basins have fueled
the concern about the new challenges and dangers involved
in the off-shore exploration. During the last two decades,
the high cost and risk involved in the activity have pushed
the research community to come up with innovative solu-
tions for building and simulating virtual prototypes of struc-
tures, under the most realistic conditions, to computation-
ally evaluate the performance of anchors, risers (oil pipes)
and underwater wells before to experience the open sea.
Nowadays, the highly-detailed numeric models that help
engineers to develop and improve new techniques for oil and
gas exploration, demand a considerable computing through-
put, pushing research labs on this eld to invest on state
of the art solutions for high-performance computing. The
term High-Performance Computing (HPC) is usually asso-
ciated with scientic research and engineering applications,
such as discrete numerical models and computational uid
dynamics. HPC refers to the use of parallel supercomput-
ers and computer clusters, that is, computing systems com-
prised of multiple processors linked together in a single sys-
temwith commercially available interconnects (Ethernet Gi-
gabit, Inniband, etc.). While a high level of technical skill
is undeniably needed to assemble and use such systems,
they need to be daily operated and programmed by non-
computer engineers and students, who are intrinsically in-
volved in their specic research elds.
Specialist engineers in oil and gas production offshore
have developed in the last 15-20 years programing skills to
implement their own programs and build their prototypes
using state of art programing techniques and following up-
dated rules of software engineering. They had to learn about
design patterns, multi-threaded programming, good docu-
mentation practices, and started to develop many other skills
for adopting open-source trends on modern cluster program-
ming. In order to explore the processing resources even
more effectively, HPC engineers are also supposed to be
able to take advantage from the parallelism of graphical pro-
cessing units (GPUs), that boost processing power of most
supercomputers guring on the the Top 500 List [3] .
Researchers on HPC seems to be aware about the need
for continuously pursuit new alternatives to overcome some
of the main computing challenges. Power consumption, tem-
perature control and the space occupied by large supercom-
puters are the main concerns for the next-generation sys-
tems [2]. In this context, the promises of recongurable
computing seems to be suitably matched to the demand for
innovative solutions for high performance computing. Be-
yond the basic advantages on size, and energy consump-
tion of popular recongurable platforms, like the Field Pro-
grammable Gate Arrays (FPGAs), it is expected that recon-
gurable computing can deliver no precedent performance
increase, compared to current approaches based of commer-
cial, off-the-shelf (COTS) processors, because FPGAs are
53
essentially parallel and might enable engineers to freely con-
struct, modify, and propose new computer architectures. It
is expected that the parallel programing paradigm shall be-
come much more than just parallel threads running on multi-
ple regular processors. Instead, we may expect that innova-
tive designs may involve also the development of some Pro-
cessing Units (PU), specically designed to deal with parts
of an algorithm, in order to reach the highest performance.
Hardware specialists fromcomputer engineering schools
have been extensively prepared to develop high-performance
designs on state-of-the-art recongurable devices, such as
encoders, lters, converters, etc. But paradoxically, most of
the people interested in innovative high-performance com-
puting are not necessarily hardware engineers. In contrast,
these users are non-computer engineers and other scientists,
with real demands on high performance computing. These
professionals know deeply their needs on HPC and probably
would be empowered if they could wire-up by themselves
innovative solutions from their own desktops. This class of
users would need to be capable of rapidly building and eval-
uating prototypes even without the intrusive interference of
a hardware designer.
Since there is a considerable learning curve to master
hardware design techniques and tools, its straightforward to
point out the problem of shortly training people from other
knowledge areas with specic hardware design skills. In or-
der to tackle the problem, we involved Computer Science
undergraduate students to assist non-computer engineers on
improving the performance of a given existing system, by
mixing the legacy software code with hardware prototypes
of Intellectual Property cores (IP cores). We then proposed
to develop modules specically designed to be easily at-
tached to a regular desktop machine, through a common
USB (Universal Serial Bus) interface. Such an approach
enable non-computer engineers to experience the recong-
urable computing benets on their native code, by inserting
calls to the remote procedures implemented in the FPGA
device across the USB interface. On the other hand, the
proposed platform allows computer science and engineer-
ing students to develop recongurable computing solutions
for real-world problems. As the result, we propose a inte-
grated platform for introducing students and engineers on
high-performance recongurable computing.
This paper is organized as follows. Section 2 is de-
voted to discuss the hardware issues that motivated us to
propose a simplied platform for teaching recongurable
computing, by dening templates and a protocol to inte-
grate logic design to a general purpose computer system.
Section 3 illustrates the utilization of recongurable com-
puting platform on engineering applications, and nally in
Section 4 we discuss the achieved results, future improve-
ments on the integrated platform and the next applications
on high-performance computing.
2. LOW-COST VERSATILE RECONFIGURABLE
COMPUTING PLATFORM
2.1. Hardware Issues
Recongurable computing FPGA-based platforms, fromdis-
tinct manufacturers of logic and third parties, are fairly avail-
able on the market. Most of them support a varied number
of interfaces to connect the board with other external de-
vices, such as network, VGA monitors, ps2 keyboards, as
well as high performance interconnect standards such as PCI
express, Gigabit Ethernet, etc.
Basically, state-of-the-art platforms offers high density
programmable logic devices with millions of equivalent gates
and support high-speed interconnects, among other facili-
ties, allowing the user of such platforms all the versatility
needed to develop complex designs such as video proces-
sors, transceivers, and many other relevant projects. These
platforms are suitable for small or complex prototypes, and
their price range from $500 $5000, not including all the
necessary software design tools. Although this category of
platforms offers an attractive support to a great variety of
experiments and prototype designs, the total cost for acquir-
ing a number of boards, becomes its adoption prohibitive on
classes. On the other hand, there are on the market low cost
platforms, equipped with medium density devices (around
500k equivalent gates), and priced under $200, such that
most schools and training centers can afford. There are,
however, restrictions on the interface support offered by these
platforms that may restrict their utilization to stand-alone
devices.
As a participant of the Xilinx University Program(XUP),
our platform (Figure 1) is based on the low-cost donation
board Xilinx Spartan 3E FPGA, available at our laboratory
for teaching purposes. Although this specic board contains
several devices and connectors around the FPGA chip to al-
low students to experiment projects integrated with network,
video (VGA), keyboard, serial standard RS-232 and some
other interfaces, the USB connector present on the board
can only be used for programming the logic devices (FPGA
and CPLD).
Initially, we tried to access the logic resources on the
board from a personal computer over the network interface.
The board has an RJ-45 Ethernet connector, as well as a
physical layer chip. But the user needs to implement the
Data Link layer in order to provide a MAC (Media Access
Control) to the network, besides the next Network Layer that
implements the basic communication functionalities across
the network into the FPGA. This rst try rapidly became a
hard solution to implement, since most of our students were
not familiar with digital design yet.
The second alternative tried to take advantage from the
100-pin Hirose connector available on the board to imple-
ment an wide interface to an external device. The external
54
USB
RJ-45
Hirose
100-pin
Pmod
6-pin
FPGA
Other connectors (VGA, PS2, RS-232, etc.)
Fig. 1. FPGA Platform: Xilinx Starter Kit Spartan-3e
device should have a friendly interface to a personal com-
puter through a standard USB interface version 2.0. Due
to specicities of the 100-pin connector, this hard to nd
adapter would demand a hard to wire device with one hun-
dred pins to connect. Again, in order to keep our approach
as simple as possible, we decided to adopt the three 6-pin
Pmod connectors. The 6-pin connectors can be found on
the local market and are easy to connect to a few ports of a
small microcontroller with USB capabilities. Two out of six
pins are dedicated to source power (+5V and GND) and the
other four pins can be used for general purpose. Then, only
12-pins are actually available for interconnecting the board
through the Pmod connectors.
We then proposed a 4-bit duplex interface based on a
microcontroller Microchip PIC18F4550, capable of trans-
ferring bytes to and from the FPGA board by dividing each
byte word (8-bit) into 2 nibbles of 4-bit. Four pins for writ-
ing to the FPGA, other four pins to read from the FPGA
and other four pins to control the writing/reading operation.
Since the whole procedure of the microcontroller to read the
USB port, send to the FPGA, read from the board and nally
send over the USB takes just a little less than 1 microsec-
ond, transfers between the FPGA and the Microcontroller
can reach the maximum theoretical throughput of 1MByte/s.
The microcontroller board utilized as interface was rapidly
built because we utilized a pre-dened platform, namely
USB Interface Stargate [1]. The Stargate platform is in-
tended to facilitate designers to propose new HID (Human-
Interface Device) products. The board offers analog and dig-
ital input/output pins, and is capable of easily communicate
with a personal computer by the USB version 2.0 making
use of native Device Drivers of most popular Operating Sys-
tems. We simply made some minor adjustments on the Star-
gates rmware to implement the dened protocol to com-
municate with the FPGA through the 12-pin interface. In
order to directly connect the Stargate to the FPGA board,
Pmod
2 x 6-pin
Protocol control
4-pin
USB 2.0
Type-B connector
OutPort
InPort
Fig. 2. Scorpion Interface Board: Modied USB Interface
Stargate to interconnect USB to the FPGA.
some minor changes on its layout had to be made, eliminat-
ing the analog pins and replacing the input and output pins
on the same corner. The updated rmware and the proposed
layout matching the physical requirements of this project
motivated a new name for this specic platform:Scorpion
Interface board (Figure 2).
2.2. Dening a Communication Protocol
We proposed a simple protocol to enable communication be-
tween the FPGA board and the interface USB built in the
microcontroller (PIC18F4550). The rmware program PIC
is intended to send and receive Bytes to and from the FPGA
board, according to a established protocol, exclusively de-
ned for the purposes of this integration.
Basically, the rmware is a loop code waiting for data
coming from the USB interface with the PC. As soon as a
byte arrives, the word is split up into the low-half and high-
half nibbles (Figure 3). The less signicant portion is put
available at the OutPort to the FPGA and the bit ag Write-
enable goes high. The data stays available until FPGAs ag
Write-done is raised. Then, Scorpion interface board makes
available the second portion bits and reset Write-enable down.
The half-byte data is read by the FPGA board, who clears
Write-done, ending the USB to FPGA writing operation.
If there is any data made available by the FPGA, the
Read-enable ag will be set 1. Then, Scorpion reads the
lower bits to low-half and sets Read-done high. Now the
FPGA makes the higher portion bits available at the InPort
and resets Read-enable. The microcontroller at Scorpion
read the data and concatenates both parts to recover the whole
byte before send it over the USB to the PC.
2.3. Integration Wrapper
In order to help the logic designer be more focused on the
design of the processing unit itself, we proposed a parame-
terizable wrapper template, which hides the communication
protocol between the FPGA and the Scorpion boards. The
data input coming from the Scorpion board is shifted along
the input operator registers (Op InReg) to become readily
available to the Processing Unit (PU). The PUs must be
dened as combinatorial logic, that is, output data will be
55
While(true)
{
Wait (until receive from USB);
OutPort = low_half;
Write_enable = 1;
Wait (until Write_done == 1)
OutPort = high_half;
Write_enable = 0;
Wait (until Write_done == 0)
if(Read_enable == 1)
{
low_half = Read(InPort);
Read_done = 1;
Wait (until Read_enable == 0)
high_half = Read(InPort);
Read_done = 0;
}
SendToUSB(high_half+low_half);
}
Fig. 3. Protocol dened in the Scorpion interface rmware
Processing
Unit
Op InReg1
Op InReg2
Op InRegN
Op OutReg
Protocol Control Unit
Input Control Output
Wrapper
template
Mapped to FPGAs Pads
connected to the Pmod pins
Fig. 4. Wrapper template around the PU.
available when all the inputs are sourced at the input regis-
ters. The output register will keep the data available as long
as the input data have not changed (Figure 4).
This restriction simplies the rapid development of sim-
ple Processing Units and their respective integration to the
platform. The integrated PU implemented on the FPGA be-
comes available to the end-user, who can immediately eval-
uate the hardware/software implementation and explore in-
novative design options.
3. INTEGRATING RECONFIGURABLE LOGIC
INTO AN USER APPLICATION
The proposed platform for integrating the traditional devel-
opment on general purpose computer and library modules
implemented in recongurable hardware (FPGAs), is here
essentially the task of including the libhid library on the
C/C++ code and make use of the Application Program In-
terface (API) designed to send and receive bytes over the
USB port (Figure 5). The API enable the use of the the pro-
posed platform, offering an abstraction layer to hide from
sendFPGA(buffer, size);
size = receiveFPGA(buffer);
Fig. 5. Basic functions to send and receive data to and from
the FPGA
the user the enumeration steps of the Scorpion interface as a
device connected to an USB port.
Our illustrative example and rst exercise to get started
on the proposed platform requires the student to design an
Adder/Subtractor in the FPGA. After have written the HDL
code and synthesized the project into the FPGA, the user
must send 3 bytes over the interface to the FPGA, which
are: The operation value (referring to an ADD=0x00 or
SUB=0x01 operation), the rst and the second operand. Such
an operation must be carried out by enclosing all the values
in a buffer and send it to the FPGA. Next, the size returned
by receiveFPGA will determine when the result is available
at the buffer.
4. CONCLUSION
Computer Science students were able to design functional
modules of processing units in VHDL, synthesize their codes
and congure the FPGA device by using the tools from the
University Program donation at the lab classes. The pro-
posed processing units typically included functional mod-
ules, such as arithmetic operators and statistical estimators.
At this present time, collaborators from a partner research
laboratory involved in the non-computer engineering projects,
such as oil and gas production, can make use of the pro-
posed modules, by attaching the proposed platform with a
pre-congured FPGA with the functional implementation
of a given operator to compare the results obtained from
the hardware implementations. Although performance is-
sues are not the major contributions of this paper, thanks to
the easy-to-use proposed platform, students and engineers
are more and more considering recongurable computing on
their academic and research projects. Our next steps include
the use of high-end platforms with high-density FPGAs and
high-speed interconnects to propose alternative solvers to
existing scientic computing libraries.
5. REFERENCES
[1] USB Interface: http://www.usbinterface.com.br
[2] Experimental Green IT Initiative Launches on Recycled HPC
System , Scientic Computing (2009)
[3] TOP500 Supercomputing list, available at:
http://www.top500.org/
56
DERIVATION OF PBKFD2 KEYS USING FPGA
Sol Pedre, Andr es Stoliar and Patricia Borensztejn
Depto. de Computaci on, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires
email: spedre, astoliar, patricia@dc.uba.ar
ABSTRACT
In this paper we analyze the key derivation algorithmused to
start a WPA PSK (Wi Fi Protected Access with Pre Shared
Key) session in order to performa brute force attack and thus
obtain the session key. We analyze its computational cost
and propose improvements to the algorithm that reduces its
complexity by nearly half. We also analyze which section
of the algorithm would be fruitful to implement in an FPGA
and we implement it. Finally, we compare the performance
of our solution in several FPGAs and with optimized im-
plementations for current CPUs. We show our solution is a
good engineering solution in means of cost of processed key
per second if compared with current CPUs performance.
We also present ideas for a design that we hope could im-
prove the performance in keys per second processed.
1. INTRODUCTION
Communications networks security is a largely studied eld
in constant advance. As new security protocols are devel-
oped, new methods to breach that security are developed as
well. A common attack, but computationally costly, is the
brute force attack to the key derivation algorithms used in
those protocols. Given a word dictionary, and the result of
the derived key, this attack consists on running the deriva-
tion algorithm on every word in the dictionary to verify its
correspondence with the derived key. To try to reduce the
time required for this attack at least two steps are taken. For
one, the algorithmis studied to nd another computationally
cheaper that produces the same result. On the other hand,
efcient implementations are built for the hardware at hand.
In our case, we will use FPGAs.
In this paper, we analyze the protocol WPA PSK (Wi Fi
Protected Access with Pre Shared Key) to perform a brute
force attack. We will use a weakness in the application of
the key derivation function PBKDF2 in this protocol that
allows a signicant reduction in the algorithms complexity,
and we implement part of the algorithm in an FPGA.
The rest of this paper is organized as follows: in section
2 we explain the protocol WPA-PSK and the needed algo-
rithms to derive the session keys. In section 3 we calculate
the cost of the algorithmand propose improvements. In sec-
tions 4 and 5 we explain what was implemented in the FPGA
and why, and describe this implementation. Finally, in sec-
tion 6 we present results in different FPGAs and a CPU, and
in section 7 we draw some conclusions and present ideas for
future work in order to enhance the obtained performance.
2. WPA-PSK
WPA (Wi-Fi Protected Access) is a protocol created by the
Wi-Fi Alliance to enhance the security of wireless networks
within the 802.11 standard.
Its PSK (Pre Shared Key) operation mode was designed
to provide authentication in small networks, as an alterna-
tive to the installation of an authentication server. It sim-
ply assumes that every node knows a secret passphrase of 8
to 63 ASCII printable characters (as required by the IEEE
802.11i-2004 standard). The authentication key of the net-
work is derived from the secret passphrase and the public
networks SSID (Service Set Identier) using the PBKFD2
(Password-Based Key Derivation Function) [1] key deriva-
tion function.
The parameters of the PBKDF2 function are a password
P and a salt S (both byte arrays), an integer C that denes the
amount of recursive applications of its underlying pseudo-
random function, and an integer dkLen that indicates the
length of DK, the key that will be derived. Usually, the
output of the underlying pseudo-random function is shorter
than dkLen. In those cases, DK is dened as the concatena-
tion of partial applications of that underlying pseudo-random
function over I different blocks. Each of these applications
has the same inputs but the salt S is redened as (S||I) in
each block.
PBKDF2 is congured in WPA-PSK to use as the un-
derlying pseudo-randomfunction the hash function HMAC-
SHA1 (Keyed-Hashing for Message Authentication with Se-
cure Hash Algorithm 1) [2]. The passphrase is used as the
password P and the SSID as the salt S. The amount of re-
cursions C is xed in 4096 and dkLen in 256 bits to de-
rive a key DK of that length. As the output of HMAC-
SHA1 is 160 bits long, the amount of blocks I is set as 2
(160 + 160 = 320 > 256 that is dkLen). In this manner,
PBKDF2 in WPA-PSK is dened as:
57
DK = T_1 || T_2
T_1= U_1_1 xor U_2_1 xor ... xor U_4096_1
T_2= U_1_2 xor U_2_2 xor ... xor U_4096_2
U_1_I = HMAC-SHA1( P, S || I)
U_2_I = HMAC-SHA1( P, U_1_I)
...
U_4096_I = HMAC-SHA1( P, U_4095_I)
The previous equations show the key derivation from
the private passphrase and the public SSID in wireless net-
works with WPA-PSK authentication and is the complete
algorithm that must be run given a word from the dictionary
and the SSID in order to perform the brute force attack.
2.1. HMAC-SHA1
HMAC (Keyed-Hashing for Message Authentication) is a
mechanism to verify the authenticity of a message using
a cryptographic hash function [2] [3]. This function may
be SHA1 (Secure Hash Algorithm 1) [4] [5] dening the
HMAC-SHA1 variant, that is dened by:
SHA1(K xor opad ||
SHA1(K xor ipad || text))
Where ipad is the 36h byte repeated 64 times and
opad is the 5Ch byte repeated 64 times (i.e, both are 512
bits long). K is the key used to authenticate the message. It
should be 64 bytes (512 bits) long: if it is less than 64 bytes
it is extended with zeros. Finally, text is the text which
authenticity is being veried (any length).
2.2. HMAC-SHA1 in WPA-PSK
In the application of HMAC-SHA1 in WPA-PSK, K is the
passphrase P. The message text is the SSID S concate-
nated with the block number I (1 or 2) in the base recursion
case. In the next recursion cases, text is the result of the
previous execution. The application is thus dened by the
following equations:
U_1_I = SHA1( P xor opad ||
SHA1( P xor ipad || (S || I)))
U_2_I = SHA1( P xor opad ||
SHA1( P xor ipad || U_1_I ))
...
U_4096_I = SHA1(P xor opad ||
SHA1(P xor ipad || U_4095_I))
2.3. SHA1
SHA1 is a cryptographic hashing function that takes an input
of any length, divides and process it in 512 bit chunks to
generate a 160 bit long hash [4] [5]. In gure 1 an iteration
of the main loop of SHA1 is shown. This iteration repeats
80 times for each 512 bit chunk. A,B,C,D and E are 32
bit words of the state, F is a non-linear function that varies
according to the number of iteration t. W is an 80 word array
constructed using the original message and K is a constant
that depends on the iteration t. The square sum represents
the saturated sumto 32 bits and <<<
n
is a n bits left rotate.
Fig. 1. An iteration of the main loop of SHA1
As SHA1 will be the core of the FPGA implementation
we show its pseudo-code:
1 Initialize variables h0,h1,h2,h3,h4
2 Extend the message to a 512 bit multiple and
divide it in 512 bits chunks
3 For each chunk
- Divide it in 16 32-bit words W[t](0<=t<=15)
- Extend those 16 words to 80:
for (t=16;t++;t<= 79)
W[t]=(W[t-3] xor W[t-8] xor W[t-14]
xor W[t-16]) rol 1
- Initialize the hash for this chunk
A = h0; B = h1; C = h2; D = h3;E = h4
- Main loop
for( t=0;t++; t<= 79)
if 0 <= t <= 19 then
F = (B and C) or ((not B) and D)
K = 0x5A827999
else if 20 <= t <= 39
F = B xor C xor D
K = 0x6ED9EBA1
else if 40 <= t <= 59
F = (B and C)or(B and D)or(C and D)
K = 0x8F1BBCDC
else if 60 <= t <= 79
F = B xor C xor D
K = 0xCA62C1D6
TEMP = (A rol 5) + F + E + K + W[t]
E=D; D=C; C=B rol 30; B=A; A=TEMP
- Add this chunks hash to the total
h0+=A; h1+=B ; h2+=C; h3+=D; h4+=E
4 Produce the final hash
hash = h0 && h1 && h2 && h3 && h4
SHA1 doesnt have conguration parameters. Its applica-
tion in WPA-PSK is no different from its execution in any
other context.
58
3. ALGORITHMS COST REDUCTIONS
In this section we will describe the improvements made in
the algorithm previous to its hardware implementation.
3.1. Metrics
In order to quantify the improvements, we need a metric for
the computational cost. The primitive functions detailed to
derive the key using the PBKDF2 function are three: con-
catenation (||), logic xor and SHA1. Fromthese, the most
costly is SHA1, and we will use it in the form amount of
512-bit chunks it processes as a metric.
3.2. Cost
We will rst analyze the cost of one HMAC-SHA1 itera-
tion, that has two applications of SHA1 (see section 2.2). In
the inner application, SHA1 is applied on (P xor ipad
||(S||I)) in its base case. About this application we can
state:
(S||I) is of variable length and shorter than 512
bits: the length of the SSID S is between 1 and 32
bytes and the block number I is always expressed as
a 32 bit integer, so the length of (S||I) is between
40 and 288 bits.
the length of (P xor ipad) is 512 bits by deni-
tion of ipad in HMAC (see 2.1).
Therefore, (P xor ipad || (S||I)) is always
shorter than 1024 and longer than 512 bits, which
means that in this application SHA1 runs over two
512-bit chunks.
The same analysis stands for the inner applications in
the recursive cases (P xor ipad || U n I)) because
(U n I) is 160 bits long (it is the output of a previous
SHA1) and therefore the whole chain is 512 + 160 = 672
bits long. That means that in all then inner applications,
SHA1 runs over two 512-bit chunks.
In the outer application of SHA1 in the algorithm, as
SHA1(x) is always 160 bits long, the length of (P xor
opad || SHA1(x)) is always 672 bits. Therefore, in
SHA1(P xor opad || SHA1(x)) the function
SHA1 always runs over two 512 bits chunks.
In conclusion, in each step of the calculation described
in section 2.2, there are 4 applications of SHA1 over 512-
bit chunks. As there are 4096 recursions of HMAC-SHA1
for each one of the two blocks, we have a total cost of 4096
4 2 = 32768 SHA1 over 512 bits.
3.3. Reduction
The chain (P xor ipad) is the rst 512-bit chunk that
SHA1 processes in the inner call to HMAC-SHA1. As P
and ipad are constant during the whole process, (P xor
ipad) remains constant during the 4096 inner calls of SHA1
in HMAC-SHA1. Therefore, it may be preprocessed to cre-
ate an intermediate result of 160 bits (a state of SHA1 be-
tween the 512-bit chunks) and may be a parameter of a new
SHA1p function that continues the execution of SHA1 with
the following 512 bit chunk.
Exactly the same happens with the outer call of SHA1
in HMAC-SHA1. The chain (P xor opad) is also 512
bit long, and constant in all the intermediate applications of
SHA1 and may be pre-calculated.
In this manner, the four applications of SHA1 on 512-
bit chunks needed to perform one step in the algorithm de-
scribed in section 2.2 are reduced to only two applications.
There are two additional SHA1 to pre-calculate SHA1(P
xor ipad) and SHA1(P xor opad). The new total
is then 4.096 2 2 +2 = 16.386 runs, reducing the com-
plexity to near half the original of 32.768 SHA1 runs.
4. HARDWARE MIGRATION
With the intention of replacing the CPUs processing we
design an electronic circuit to be instantiated in a recong-
urable hardware device such as an FPGA.
The rst step for this design consists of selecting exactly
which part of the original algorithm will be implemented,
given the nite resources of the FPGA and taking into ac-
count the complexity in its programming and verication.
As we have already analyzed, the principal component
of the code is the SHA1 algorithm. We showed in 3.3 that
16.384 times from the 16.384 + 2 we execute a partial im-
plementation of SHA1 that takes as input 160 bits of pre-
processed state (that is, SHA1(P xor ipad) or SHA1(P
xor opad) ) and another new 160 bits that must be pro-
cessed. Therefore, we chose this part of the problem to be
implemented in hardware.
Another important consideration is the necessary band-
width to transfer the partial results between the CPU and the
instantiated FPGA circuit. In our case we use an ethernet
100 MB/s connection. If we choose to implement only the
partial SHA1, any gain in the processing time that the cir-
cuit could generate would be too small compared to the time
needed to transmit the 160+160 bits necessary to execute.
As PBKFD2 executes 4095 recursive calls in each block,
if that control is also implemented in the FPGA, the trafc
is considerably reduced:
1. CPU FPGA:
(a) preprocessed SHA1(P xor ipad) (160 bits)
(b) preprocessed SHA1(P xor opad) (160 bits)
59
(c) Result of the base case of the rst block : U 1 1 (160
bits)
(d) Result of the base case of the second block : U 1 2
(160 bits)
(e) Total: 1604 = 640 bits for all the process execution.
2. FPGA CPU:
(a) Result T 1 (160 bits)
(b) Result T 2 (160 bits)
(c) Total: 160 2 = 320 bits para for all the process
execution.
The time needed to transfer using Ethernet 100 MB/s
these 960 bits is nothing compared with the processing time
gain. In conclusion, we will implement in hardware the
40952 recursive call to the partial SHA1 that has as inputs
the preprocessed 160 bits and processes the remaining 160
bits.
5. HARDWARE IMPLEMENTATION
In gure 2 we show the state machine implemented in the
FPGA.
Fig. 2. FSM implemented in the FPGA
The Load and Prefetch states are in charge of load-
ing the necessary operands for the FPGA processing. The
following states implement both recursive blocks of 4095
SHA1. The Unload state transmits the nal results back to
the CPU.
As shown in section 2.2, both blocks of 4095 applica-
tions of HMAC-SHA1 dont have data dependency and thus
may be executed in parallel. However, within each block I,
every recursive application U n I depends on the result of
the previous application, and therefore it may only be ex-
ecuted sequentially. Thats why the state Process1 corre-
sponds to the inner SHA1 application for both blocks simul-
taneously, and the state Process2 corresponds to the outer
application for both blocks as well. The preprocessing states
initialize the needed variables for the application of SHA1
(W, A, B, C, D and E).
5.1. Implementation of the Process states
The core of these states is the implementation of the main
loop of SHA1 that is shown in the pseudo-code in section
2.3. All the operations in this loop are done in parallel in
one clock, therefore the state is executed 80 times. Here
lays the advantage of the hardware implementation.
In gure 3 the registers and logic implemented to solve
all the operations in one clock are shown. The rectangles re-
present 32 bit wide registers, and the circles represent com-
binational logic sections. In each clock, the data ows from
the registers, feed the combinational logic and the results are
stored again in the registers following the arrows. All the ar-
rows represent 32 bit wide data exchanges but the arrows in
the control registers t 80 and t 16, as explained in future
sections.
Fig. 3. Implementation diagram for states Process 1 and
Process 2
The circles calc f and calc k correspond to the com-
binational calculation of F and K. To calculate TEMP we
60
simply add up the results of previous calculations as shown
in gure 3. In the left side of this gure, the connections
between the state registers A,B,C,D and E are shown, in or-
der to update those registers as shown in the pseudo-code in
section 2.3.
5.1.1. Implementation of array W
The implementation of array W deserves a special mention.
It is shown in the right side of gure 3. In the original al-
gorithm, W has 80 positions, each one 32 bit wide. In the
rst 16 iterations, the values are loaded with the message
that is being processed, and then in each iteration t the cor-
responding position W(t) is calculated using four values of
W that depend of t.
As in each iteration only the previous 16 positions of
W are used, a rst improvement is to store only 16 posi-
tions and calculate W(t%16) in each iteration. Such an im-
plementation in an FPGA requires four 32-bit multiplexers
from 16 positions to 1 to select the four needed positions to
calculate the current W(t%16), and one extra multiplexer
to select the position in W where the result must be stored.
These are 5 very large multiplexers, with a very large area
cost and an increase in the delay of the calculations.
Our implementation takes advantage of the fact that the
same positions relative to t are always used. We imple-
mented array W as a stack, where position W(j+1) is moved
to position W(j) in each clock. This has a minimum cost
in hardware and completely eliminates the multiplexers, be-
cause the same positions are always used in the calculations
(0,2,8 and 13) and the result is always pushed onto the stack.
The combinational logic needed for the calculation of
W(t%16) is in circle calc w(t), that feeds the stack W
and is also used in the calculation of TEMP.
5.1.2. Calculation of k, f and W(t%16)
As shown in the pseudo-code described in section 2.3, the
calculation of k and f depends on whether t is smaller than
20, 40, 60 or 80. The implementation inferred from the al-
gorithmis to use a counter and several comparators to select
the proper function. Our implementation is better, since it
uses less area and eliminates the comparison time, thus re-
ducing the delay in each clock.
We use two shift-registers: one 80-bit wide initialized
with 40 bits set followed by 40 cleared bits; and one 40-
bit wide initialized with 20 bits set and 20 bits cleared. In
each clock we take the rst bit of both registers, resulting
in the pair 00 during the rst 20 clocks (corresponding to
t<= 20), the pair 01 the next 20 clocks (corresponding to
20 <=t<= 40), then the pair 10 and nally the pair 11.
These bits select the correct output of a multiplexer in the
logic of calc k and calc f as shown in gures 4 and 5.
Fig. 4. Implementation of k calculation
Fig. 5. Implementation of F calculation
In the calculation of W(t) a comparison is also needed:
if t is smaller than 16. In this case, we use the same idea that
in the previous comparisons. We initialize a shift-register
with 16 bits set, and in each clock we feed it a cleared bit.
During 16 clocks, the rst bit of the shift-register is set, and
the rest of the 80 clocks it is cleared. This bit selects the
correct answer in a multiplexer, as shown in gure 6.
Fig. 6. Implementation of W(t%16) calculation
6. RESULTS
To compare the performance of the solution, we synthesize
it for several FPGAs and compare the times with the ones
obtained with the aircrack-ng [6] in a Core 2 Duo. It is
important to notice that the code of the aircrack-ng is
a highly optimized code for the Core 2 Duo processor: the
SHA1 kernel is programmed in assembler and takes full ad-
vantage of the SIMD (Single Instruction Multiple Data) in-
structions available in the IA-32 and IA-64 architectures. In
this way, it processes 4 keys in parallel in each core, taking
full advantage of the hardware capacities of the processor.
To run the tests we used several FPGA from Xilinx [7].
We conducted simulation tests with several Spartan 3A and
61
Virtex 4. The results are shown in table 1 comparing the
device clock frequency, the amount of keys per second pro-
cessed, how many keys in parallel are processed, the cost
in dollars, and nally the cost in dollars for key per second
processed.
device clk(Mhz) K/s K/dev U$D $/(K/s)
Spartan 3A -4 81 120 4 60 0.12
Spartan 3A -5 95 141 4 111 0.2
Virtex 4 -10 134 200 33 8076 1.23
Virtex 4 -11 155 231 33 10340 1.36
Virtex 4 -12 180 268 25 7737 1.15
Core 2 Duo 1500 150 8 210 0.7
Table 1. Performance comparison
As we can see in table 1, the achieved solutions in terms
of keys per second processed are similar between the dif-
ferent Spartan 3A and the Core 2 Duo, while the Virtex
achieved better results. We didnt obtain a signicant im-
provement in keys per second processed, probably because
these algorithms are prepared so that the data dependency is
high and parallel implementations are hard to get, making
the brute force attack more costly.
On the other hand, when comparing the price per key,
signicant improvements are obtained in the Spartan 3A fa-
mily. We conclude that this is a good solution from the en-
gineering point of view. Much more if we take into account
not only the prices for the chips (as in table 1) but also the
price for the needed electronics so that those chips can work.
Given that the needed electronics for a processor to work is
far greater than the one needed for an FPGA, the breach is
signicantly broaden.
7. CONCLUSIONS AND WORK IN PROGRESS
In this paper we presented a hardware implementation using
FPGA for the brute force attack to the key derivation algo-
rithm used in WPA-PSK. We performed an analysis of the
algorithm and found an improvement that reduces by half
its computational cost. We then analyzed which part of the
algorithm we should implement in FPGA. We implemented
the algorithmand optimized some parts of it thinking specif-
ically of the target FPGA architecture. Finally, we compared
the achieved performance of the implementation in several
FPGAs and in a Core 2 Duo, using code specially optimized
for that processor.
The results indicate that the FGPA implementation of
the optimized version of the algorithmis a good engineering
solution given that the achieved price per key per second is
much smaller than in a CPU.
7.1. Work in progress
Seeking to improve the amount of keys per second, we are
developing an alternative implementation.
In the implementation explained in this work, we fo-
cused in performing the greatest amount of parallel opera-
tions possible. Thats why all the operations within the main
loop are done in parallel. In this way, the resulting clock fre-
quency is low, given that there are a great number of chained
operations, and the occupied area for the processing of one
key is large, reducing the amount of keys that can be pro-
cessed in one chip.
The idea in this new implementation is quite the oppo-
site: sequentially execute simple instructions to achieve an
elevated clock frequency and reduce the logic needed to pro-
cess one key so that many keys can be processed in parallel
in one chip. The idea is to implement a simple ALU, specic
to this problem, that contains only the needed operations to
execute SHA1. In this way an assembly program for this
ALU can be kept in a block ram that controls all the imple-
mented ALUs at once, and the state of the key derivation can
be kept in a block ram for each ALU available. This idea is
still in the design stage, so we still dont have an estimation
of its possible performance.
8. REFERENCES
[1] B. Kaliski RFC 2898, pp. 134, Sept. 2000.
[2] M. Bellare, R. Canetti, and H. Krawczyk, Keyed hash func-
tions and message authentication, Lecture Notes of Computer
Science, vol. 1109,pp. 115, 1996.
[3] M. Bellare, R. Canetti, and H. Krawczyk, RFC 2104, pp.
111, Feb. 1997.
[4] D. Eastlake and P. Jones RFC 3174, pp. 122, Sept. 2001.
[5] National Institute of Science and Technology, USA, Se-
cure Hash Standard, Federal Information Processing Standard
(FIPS) 180-1, April 1993.
[6] www.aircrack-ng.org
[7] www.xilinx.com
62
AUTOMATIC SYNTHESIS OF SYNCHRONOUS CONTROLLERS
WITH LOW ACTIVITY OF THE CLOCK
Jozias Del Rios*, Duarte L. Oliveira
Diviso de Engenharia Eletrnica
Instituto Tecnolgico de Aeronutica ITA
So Jos dos Campos So Paulo Brazil
email: joziasdelrios@gmail.com, duarte@ita.br
Leonardo Romano


Departamento de Engenharia Eltrica
Centro Universitrio da FEI
So Bernardo do Campo So Paulo Brazil
email: leoroma@uol.com.br
ABSTRACT
In a digital system the activity of the clock signal is a major
consumer of energy. It consumes 15% to 45% of energy
consumed. In this article we propose a method for
automatic synthesis of synchronous controllers with low
activity in the clock signal. The reduction of activity of the
clock is obtained by applying two strategies. In the first
strategic, our controllers operate in the transitions of both
edges of the clock signal. It allows a 50% reduction in the
frequency of the clock signal, but with the same processing
time. An important feature is that our controllers uses only
flip-flops that are sensitive to a single edge of the clock
signal (single-edge triggered SETFF). In the second
strategy, the clock signal is inhibited in our controllers
when it encounters a state with self-loop.
1. INTRODUCTION
With the evolution of microelectronics, more and more
high-complexity digital systems are been designed. A
common characteristic of these systems is the fact that they
are battery-fed, and are conceived for different applications
such as wireless communication, portable computers,
aerospace (satellites, missiles, etc), aviation, automobile,
medical applications, etc. Since they are battery-fed, it is
desirable that the batteries have a long life span, and,
therefore, power dissipation is a very important parameter
in the design of such systems [1]. These systems may be
implemented in VLSI technology and/or FPGAs (Field
Programmable Gate Array). The FPGAs have become a
popular means to implement digital circuits. FPGA
technology has grown considerably in the past few years,
generating FPGAs with up to 50 million gates, allowing
that complex digital systems have been programmed in
such devices [2].
Traditionally, digital circuits are implemented with
components built with CMOS technology. The power
dissipated in CMOS components follows the following
expression [3]:

P
TA
=1/2.C.V
2
DD
.f.N + Q
sc
.V
DD
.f.N + I
leakage
.V
DD
(1a)

where: P
TA
is the total average dissipated power, V
DD
is the
supply tension, f is the operation frequency, the N factor is
switching activity, that is, the number of transitions at a
gate output, and the Q
SC
factor and C are, respectively, the
load quantity and the capacitance [3].

In equation (1), the first term represents the dynamic
dissipated power. The second term represents the dissipated
power related to short current. The third term represents the
static dissipated power related to leakage current. In static
CMOS technology, the largest dissipated power fraction
occurs during the switching of events (dynamic power) [3].
Average power dissipation at g gate may be simplified to
the first term of (1):
P
AVERAGE-g
=1/2.C
g
.V
2
DD
. f. N
AVERAGE-g
(1b)

The synthesis of finite state machines (FSM) plays an
important role in the design of battery-fed digital circuits.
Many digital circuits are described by an architecture
consisting of a network of controllers + data-paths and/or
processors [4]. The synchronous controllers of such circuits
are often specified as an FSM consisting of several states
and transitions between states, i.e., they are specified by a
state transition graph (STG).
The techniques for reducing dynamic power are applied at
the different levels of digital design [1]. In the synthesis of
synchronous controllers, solutions proposed for reducing
power are being offered at the logic level: 1) clock logic
control (gated-clock) [5],[6]; 2) Flip-Flops triggered at the
two edges of the clock transition [7],[8],[9]; 3)
decomposition [10]; 4) state assignment [11]; 5) logic
minimization [12].
In a digital system, the part sequential is the main
contributor to power dissipation. Recent studies have
shown that in such systems the clock consumes a large
percentage (15% to 45%) of the systems power [5]. So,
dissipated power of circuit may be considerably reduced if
clock activity is reduced. Among the solutions proposed for
*Is student of Electronic Engineering in ITA
Is a candidate for Master of Science in ITA.

63
reducing the dynamic dissipated power of the synchronous
controllers, the first two are very interesting.
1.1. Reduction in the activity of the clock
The first strategy (gated-clock) uses an additional logic to
inhibit (stop) the clock signal in states with self-loop. For
some FSMs, most clock cycles are used in states with self-
loop. In these states there is also dynamic power
dissipation, because internally the Flip-Flops (FFs) state
change, although no changes occur in the outputs of FFs.
Benini et al [5] proposed a method and target architecture
for synchronous controllers with inhibition clock. These
controllers operate on the single-edge of the clock signal.
The second strategy is to use in the controllers, FFs that
are sensitive at the rising / falling edges of the clock signal
(double-edge triggered - DET) [9]. For a rate of data
processing (data throughput), the DET-FFs requires 50% of
the clock, therefore reduces the activity of the clock.
Comparing DET-FFs and FFs operating on the edge of the
simple transition of clock (single-edge triggered - SET), we
notice an increase in power consumption and area (number
of transistors) [8].

A promising approach that reduces the activity of the
clock signal is the union of two strategies: gated-clock +
FSM using DET-FF. This approach has two problems: 1)
Most libraries VLSI standard-cell do not include this kind
of FF (DET), and macro-cells in FPGAs use SET D-FF
[13]; 2) for the proposed architecture in [5], inhibition of
clock in states with self-loop is sub-optimal when applied to
FSM Moore which use DET-FFs.

In this paper we propose a tool for automatic synthesis of
synchronous controllers Moore model. Our method
drastically reduces the activity of the clock signal. This
reduction is achieved through two strategies. In the first
strategy, our method synthesizes synchronous controllers
that operate on both edges of the transition of the clock
signal, but uses only SET-FF-Ds. In the second strategy, we
propose a new architecture that inhibits the clock signal on
both edges of the transition of the clock signal in states with
self-loop. Figure 1 shows the target architecture model
Moore used to implement our synchronous controllers.
STATES
VARIABLES
EXCITATION
LOGIC
I NHI BI T I ON
L OGI C
EXCITATION
LOGIC
LATCH
LATCH
BANK
OF
FLIP- FLOPS
BANK
OF
FLIP-FLOPS
STATES
VARIABLES
OUT PUT
L OGI C
STATES
VARIABLES
INPUTS
OUTPUTS
CLK
D
D
h
Gcl k1
Gcl k2

Fig. 1. Target architecture proposal: Moore model.
The remainder of this paper is organized as follows. In
section 2 present some concepts for understanding of our
method; in section 3 introduce our method; in section 4
illustrate our method with an example from literature; in
section 5 we discuss the advantages and limitations of our
method and some results and finally; in section 6 present
our conclusions and future work.
2. PRELIMINARY
Synchronous controller is a deterministic FSM of type
Moore or Mealy whose behavior is described by a STG.
The vertices represent states and arcs state transitions. The
main idea of our method is to partition the Moore STG in
two sub-sets of state transitions such that each subset is
associated respectively with a bank of FFs-D. A bank of
SET-FFs-D operate at the rising edge (CLK +) of the clock
signal and the other bank of SET-FFs-D of the falling edge
(CLK). The partitioning is achieved by constructing a
graph called a clock transition graph (CTG) proposed in
[14].
2.1. Clock transition graph
In this section we present the CTG and the concepts for its
manipulation. In the CTG the state transitions between any
two states of the STG is defined as a bridge, because the
transition is not directional.

Definition 2.1. Clock Transition Graph CTG is an
undirected graph <V, A,S>, where V is the set of vertices
that describe states, A is the set of edges that describe
bridges and are labeled with the clock signal that are
polarized in {+, }. The S is state initial.

Figures 2a,b show respectively the STG with inputs and
outputs omitted and its CTG without label (clock signal).
Figure 2a shows six state transitions. Figure 2b shows four
bridges. The self-loop in state B is not a bridge and the two
state transitions between states C and D form a single
bridge (CD). The interconnection between a set of
bridges forms a path. If the connection of bridges begins
and finishes in the same state defines a cycle. Figure 2b
shows the cycle {(AB), (BC), (CD),(DA)}. A
bridge is defined as positive if it is associated the rising
edge of clock (CLK +). If it is associated with falling edge
of clock signal is defined as negative (CLK). A cycle is
called degenerate if all the bridges belonging to the same
cycle are the same type, where positive or negative. A cycle
is even if he has an even number of bridges, otherwise the
cycle is odd. The CTG is called degenerate if all the bridges
are of the same type. The SFSM is degenerate if your CTG
is degenerate. Figures 3a,b show respectively a cycle with
64
degenerate pair and a cycle with non-degenerate pair.
Figure 3a the CTG is degenerate.
B A
C D
B
C
A
D
(a) (b)

Fig. 2. Specification: a) STG ommitting inputs and
outputs; b) CTG omitting clock signal.
(a ) ( b )
B
C
A
D
C L K +
C L K +
C L K +
C L K
B
C
A
D
C L K +
C L K +
C L K +
C L K +

Fig. 3. CTG: a) cycle with degenerate pair; b) cycle with
non-degenerate pair.
3. AUTOMATIC SYNTHESIS: METHOD
Our controller is a FSM incompletely specified of type
Moore. Our tool synthesizes our controller using flip-flops
D. It follows the traditional procedure and has five steps:
1. Capture the description of the controller in the STG
Moore. The tool accepts the description of STG either
in the format kiss2 [15], or in the proposed format
called EMS (Explicit Machine Specification) [16].
2. Perform the states minimization of the STG using the
partitioning algorithm and get the STG
MIN
[15]. In this
step the partitioning algorithm that is switched for
specifications that are completely specified is
modified to accept incomplete specifications. The
modification uses a heuristic to specify the outputs
and states don't-care.
3. From STG
MIN
to build the CTG (section A).
4. Using the CTG encode the STG
MIN
with reduce
switching (obtain the STG
MIN-COD
) (section B)
5. From STG
MIN-COD
using the Espresso algorithm [15],
extract in the form of sum-of-product the equations of
excitation, output and inhibition.
3.1. Generation of CTG
The CTG is obtained from the STG
MIN
. Our algorithm is
composed of three steps [16]:
1. Generate the CTG and extract the set of bridges and
cycles.
2. To adjust the parity of each cycle.
3. To define label (type) for bridges
The algorithm seeks to define in the CTG an alternation in
the type of bridge and the definition of the type of bridge in
a cycle must satisfy the lemma 3.1.
Lemma 3.1 (without proof). Let E={A,B,C,....,Z} the set of
states of CTG and Cy a cycle any of CTG, where
Cy={(AB),(BC),...,(XA)}, and (DK) Cy is a bridge of
CTG. We say that the CTG is non-encoded in the non-
degenerate form if only if in Cy there is a unique path of
bridges positive or negative.

In the first step the algorithm extracts the orderly bridges
without label and all cycles of CTG. This task is realized by
traveling by depth the STG
MIN
. The second step verifies in
each cycle to its parity. For cycles with an odd number of
bridges, our method optionally adjusts these cycles for an
even number of bridges. The adjustment is accomplished by
introducing state NOP. The advantage of balanced
partitioning is that allows a greater reduction in activity of
the clock signal and in the number of state variables. Figure
4a shows a CTG with a cycle of odd bridges. Figure 4b
shows the CTG with adjustment of parity by introducing
the state NOP. The third step of the algorithm define
(labels) the type of each bridge. If the type is negative
bridge the label is CLK. If the type is positive bridge the
label is CLK +. Figure 3a does not satisfy the lemma 3.1,
because there is only one unique path to bridge positive.
Figures 5a,b show respectively the cycles balanced and
unbalanced. They satisfy the lemma 3.1 because there are at
least two paths of bridges positive and negative. The
introduction of NOP states in a CTG is an alternative to
satisfy the lemma 3.1.
A B
C D
E
A B C
D E N O P
(a ) ( b )

Fig. 4. CTG: a) cycle odd of bridges; b) adjustament of
cycle.

Fig. 5. GTR: a) unbalanced cycle; b) balanced cycle.
3.2. State assignment
In this section we describe the proposed algorithm coding
of the CTG. The encoding is realized in three steps [16]:
1. Symbolic coding
2. Reduction code
3. Binary encoding
65
3.2.1. Symbolic coding
The target architecture shown in Fig. 1 has as characteristic
the partitioning of the FF's in two banks. A bank operates
the FF's at rising edge of clock (B_CLK+). The other bank
operates in the falling edge of clock (B_CLK). In the first
step the proposed algorithm symbolically encodes the CTG.
Each state of the CTG is symbolically coded and each code
symbol is formed by two semi-symbolic codes. It must
satisfy the rule of partition code (rule_pc) and each half-
code (integer value) is related to a bank of FF's. This code
must satisfy the rule_pc below, where each state is encoded
symbolically with two semi-concatenated codes.

Rule_pc: Let the bridge Pj of the states A and B of the
CTG and the semi-codes Sci, Sck B_CLK+ and Scx, Scy
B_CLK where Sc is an integer value. If Pj is negative,
so the states A and B must have the some semi-code
B_CLK+, for example: A=Sci&
1
Scx and B=Sci& Scy. If Pj
is positive, so the states A and B must have the some semi-
code B_CLK, for example: A=Sci&Scx and
B=Sck&Scx.

Figures 6a, b show the CTG encoded symbolically, where
in Fig. 6a the encoding satisfies the rule_pc and the Fig. 6b
does not satisfy, because the transition BC violates
rule_pc. Figure 6a have respectively the symbolic codes
positive and negative that are [0,1,2] and [0,1,2].
A B C
D E NOP
(b)
CLK+
CLK+
CLK+
0&0 1&0 0&1
0&2 2&1
2&2
A B C
D E NOP
(a)
CLK+
CLK+
CLK+
0&0 1&0 1&1
0&2 2&1
2&2
CLK
CLK
CLK
CLK
CLK
CLK

Fig. 6. CTG Symbolically encoded: a) satisfies the
rule_pc; b) non-satsfies the rupe_pc.
3.2.2. Reduction of code
The reduction algorithm of state variables realizes the
merge of semi-codes [16]. The algorithm generates all
combinations of merge of semi-codes (brute force). Figures
7a,b,c show the maps of semi-codes.
Semi-neg
S
e
m
i-p
o
s
0 1 2
D
C
B
F
A
7
6
5
4
3
E G
I
H
S
e
m
i-p
o
s 0 1 3
D
C
B
F
A
7
6
5
4
E
G
I
H
Semi-neg
(a) (b)
0 1 2
D
C
B
F
A
7
5
4
3
E
G
I
H
(c)
Se
m
i-p
o
s Semi-neg

Fig. 7. Maps of semi-codes: a) initial; b) merge of bridges
positive; c) merge of bridges negative.

1
The symbol & means concatenation.
3.2.3. Binary encoding
The proposed algorithm for encoding binary of CTG uses
the concepts of inversion masks and Snakesequence [16].
The encoding is realized in two steps. The first step is to the
bank B_CLK+ and the second to the bank B_CLK.
Figures 8a,b shows the CTG encoded.

Fig. 8. CTG encoded: a) three variables; b) four variables.
4. CASE STUDY
In this section we illustrated our method with the seven.ems
benchmark (Fig. 9). The SYntool_DET tool reads the
specification at the format kiss2 or format EMS. The
second step is states minimization. The states D and E were
merged. The third step generates the CTG and was used the
option of adjust of parity of the cycle. The state NOP was
inserted between the states H and A (Fig.10), and CTG is
labeled with the clock signal (CLK) (Fig. 11). In the fourth
step is realized at the symbolic encoding of CTG (Fig. 12).
The symbolic encoding resulting needed in the B_CLK+ is
of three codes [0,1,2] and in the B_CLK four codes
[0,1,2,3]. The next step is to reduce code. There was the
elimination of the code 2 in the B_CLK+ therefore replaced
by two codes [0,1] (see Fig.13, 14). In the last stage of the
fourth step is realized the binary encoding. Figure 15 shows
the CTG encoded. The coding needed uses three state
variables. Two state variables for the bank B_CLK and
one variable for the bank B_CLK+. The fifth and last step
is logic minimization. Figures 16,17,18 show the resulting
logic.
B / 0 1 1
E / - 0 0 D / 0 - 0
A / 1 1 1
F / 1 1 1
H / 1 0 1 G / 0 1 0
C / 1 0 0
b c
a
a c + b c
a + b
b
a b
b + c
a b
a
a a b
a b c
c b
b
b c
a b
b c
x y z

Fig. 9. STG: benchmark seven.ems
B
D - E
A
F C
H G N O P - H A

Fig. 10. CTG
-MIN-CP
: benchmark seven.ems with adjust of
parity
66
B
D -E
A
F C
H G N OP - H A
C LK+
CL K+
C L K+
CL K+
C LK
C L K
C LK
CL K
C LK

Fig. 11. CTG labeled
B
D-E
A
F C
H G NOP- HA
CLK+
CLK+
CLK+
CLK+
2&3 1&0
0&2 0&3
0&0 0&1 2&1
1&2
CLK
CLK
CLK CLK
CLK

Fig. 12. CTG symbolically encoded
S
e
m
i -p
o
s
S
e
m
i - p
o
s
Semi - n eg 0 1 2
0
1
2
3
A
N OP -
HA
D- E
F
B
C
H
G
0 1
0
1
2
3
A
N OP -
HA
D- E
F
B
C
H
G
Semi -n eg
(a ) (b)

Fig. 13. Map of symbolic codes: a) initial; b) reduced.
B
D -E
A
F C
H G N OP - H A
C LK+
C L K+
C L K+
CL K+
1&3 1&0
0& 2 0&3
0 &0 0&1 1& 1
1&2
C LK
C LK
C L K
C LK
C L K

Fig. 14. CTG with reduced symbolic codes
1 1 1 1 0 0
0 0 1 0 1 1
0 0 0 0 1 0 1 1 0
1 0 1
B
D - E
A
F C
H G N O P - H A
C L K +
C L K +
C L K +
C L K +
C L K
C L K
C L K
C L K C L K

Fig. 15. CTG encoded
D M 1 Q M 1
Q M 1
F F - D
D M 0 Q M 0
Q M 0
F F - D
D P 0 Q P 0
Q P 0
F F - D
Q P 0
Q M 1
Q M 0
a
b
c
Q P 0
Q P 0
Q P 0
Q M 0
Q M 0
Q M 0
Q M 1
Q M 1
Q M 1
b
a
Q P 0
Q P 0
Q M 1
Q M 0
a
b
Q P 0
Q P 0
Q M 0
Q M 0
Q M 0
Q M 0
a
Q M 1
Q M 1
Q M 1
b
a
b
G c l k 2
G c l k 1

Fig. 16. Logic circuit: excitation logic
L A T C H
L A T C H
C L K
D
D
h
G cl k 1
G c l k 2
Q M 0
Q M 1
Q P0
Q M 0
Q M 1
a
b
a
b

Fig. 17. Logic circuit: inhibition logic of clock
X
Y
Z
Q M 0 Q M 1 Q P O Q M 0 Q M 1

Fig. 18. Logic circuit: output logic
5. DISCUSSION & RESULTS
Our tool has two options for automatic synthesis, which is
the conventional and clock reduction. The conventional
synthesis follows the traditional procedure [15]. We applied
our tool in ten examples from the literature (benchmarks).
Table 1 shows the data of the specification and of
conventional synthesis obtained by our tool. These data are:
processing time and excitation logic that are related to the
number of literals and the number of ports. Table 2 shows
the data obtained by the synthesis of clock reduction
including excitation logic and inhibition function h. The
67
column of bank of flip-flops shows the number of FFs that
will operate on the rising edge (+) and the in falling edge
() of the clock signal. Compared with the conventional
synthesis, our method achieved a 14% reduction in the
number of literals and a penalty of 5% in the number of
gates.
Table 1. Results of conventional synthesis
T i m e o f
P r o c . ( s )
S p e c i f i c a t i o n
E x a m p l e s
I n p u t s /
O u t p u t s
S t a t e s
C o n v e n t i o n a l
s y n t h e s i s
L i t e r a l s G a t e s
A l a r m
C o m p l e x
D u m b b e l l
M a r k 1
P m a
S e v e n
S i x
S h i f t r e g
T h r e e
T m a
5 / 3
3 / 3
2 / 2
5 / 1 6
8 / 8
3 / 3
1 / 1
3 / 2
2 / 1
7 / 6
3
1 1
8
1 4
2 4
8
8
6
3
2 0
1 7 7 2
1 0 1 2 9 1 2 0
2 3 1 2 2
1 2 8 3 8 2
4 1 8 1 1 2 1 2 0
6 6 2 8 1
5 1 2 1 2
1 5 8 1
1 5 8 1
2 6 7 8 0 1 2 0


Table 2. Results of synthesis of R_A_clock
S p e c i f i c a t i o n
E x a m p l e s
I n p u t s /
O u t p u t s
S t a t e s
S y n t h e s i s w i t h
R e d u c e d a c t i v i t y o f t h e c l o c k
L i t e r a l s G a t e s
T i m e o f
P r o c. ( s )
A l a r m
C o m p l e x
D u m b b e l l
M a r k 1
P m a
S e v e n
S i x
S h i f t r e g
T h r e e
T m a
5 / 3
3 / 3
2 / 2
5 / 1 6
8 / 8
3 / 3
1 / 1
3 / 2
2 / 1
7 / 6
3
1 1
8
1 4
2 4
8
8
6
3
2 0
2 8 8 3
6 9 3 5 1 2 7
2 3 1 1 4
1 3 0 4 9 3
3 4 9 1 0 7 2 8
7 0 3 3 6
4 3 2 0 1
1 7 1 0 2
2 4 1 2 3
2 1 8 7 4 1 8 5
B a n k o f
F l i p - F l o p s
1 + / 1 -
2 + / 2 -
2 + / 1 -
2 + / 3 -
4 + / 4 -
2 + / 2 -
1 + / 3 -
2 + / 1 -
1 + / 1 -
2 + / 4 -

6. CONCLUSION
This article presented the Syntool_DET tool that
automatically synthesizes synchronous controllers with
inhibition of clock and operating in the two edges (rise /
fall) transition of the clock signal. The controllers only use
FF's that are sensitive to a single edge of the clock signal.
This feature allows a large reduction in the activity of the
clock, therefore, reduction in power consumption. The tool
Syntool_DET was implemented in C+ + with about 10,000
lines of code. For future work we intend to thoroughly test
the tool and perform the estimation of power for a large set
of benchmarks and compared with the conventional
synthesis [17].

7. REFERENCES
[1] Li-Chuan Weng, X. J. Wang, and Bin Liu, A Survey of
Dynamic Power Optimization Techniques, Proc. Of the 3rd
IEEE Int. Workshop on System-on-Chip for Real-Time
Applications, pp. 48-52, 2003.
[2] J. J. Rodriguez, at. al., Features, Design Tools, and
Applications Domains of FPGAs, IEEE Trans. On
Industrial Electronics, vol. 54, no 4, pp. 1810-1823, August,
2007.
[3] F. Najm, A Survey of Power Estimation Techniques in
VLSI Circuits, IEEE Trans. On VLSI Systems, vol. 2, no.
4, pp.446-455, December 1994.
[4] L. Jozwiak, et al., Multi-objective Optimal Controller
Synthesis for Heterogeneous embedded Systems, Int. Conf.
on Embedded Computer Systems: Architectures, Modeling
and Simulation, pp. 177-184, 2006.
[5] Luca Benini and G. De Micheli, Automatic Synthesis of
Low-Power Gated-Clock Finite-State Machines, IEEE
Trans. on CAD of Integrated Circuits and Systems, Vol.15,
No.6, pp.630-643, June 1996.
[6] Q. Wu, M. Pedram, and X. Wu, Clock-Gating and Its
Application to Low Power Design of Sequential Circuits,
IEEE Trans. On Circuits and Systems-I: Fundamental
Theory and Applications, vol. 47, no.103, pp.415-420,
March 2001.
[7] G. M. Strollo et al., Power Dissipation in One-Latch and
Two-Latch Double Edge Triggered Flip-Flops, Proc. 6th
IEEE Int. Conf. on Electronic, Circuits and Systems,
pp.1419-1422, 1999.
[8] S. H. Rasouli, A. Kahademzadeh and et al. Low-power
single- and double-edge-triggered flip-flops for high-speed
applications, IEE Proc. Circuits Devices Syst., vol. 152, no.
2, pp.118-122, April 2005.
[9] P. Zhao, J. McNeely, et al., Low-Power Clock Branch
Sharing Double-Edge Triggered Flip-Flops, IEEE Trans.
On VLSI Systems, vol. 15, no.3, pp.338-345, March 2007.
[10] B. Liu, et al., FSM Decomposition for Power Gating
Design Automation in Sequential Circuits, 76th Int. Conf.
on ASIC, ASICON, pp.944-947, 2005.
[11] S. Chattopadhyay, et al. State Assignment and Selection of
Types and Polarities of Flip-Flops, for Finite State Machine
Synthesis, IEEE India Conf. (INDICON), pp.27-30, 2004.
[12] J.-Mou Tseng and J.-Yang Jou, A Power-Driven Two-
Level Logic Optimizer, Proc. Of the ASP-DAC, pp.113-
116, 1997.
[13] Inuernet: www.altera.com, 2009.
[14] D. L. Oliveira, et al., Synthesis of Low-Power Synchronous
Controllers using FPGA Implementation, IEEE IV
Southern Conference on Programmable Logic, pp.221-224,
2008.
[15] R. H. Katz, Contelporary Logic Design, The Benjamin/
Cummings Publishing Company, Inc., 2
a
edition 2003.
[16] Jozias Del Rios, et al., Automao do Projeto de Circuitos
Controladores Sncronos de Baixa Potncia, Relatrio
Tcnico ITA Junior, 2008.
[17] J. H. Anderson and F. N. Najm, Power Estimation
Techniques for FPGAs, IEEE Trans. On VLSI Systems,
vol. 12, no. 10, pp.1015-1027, October, 2004.

68
AJUSTE DE HIERARQUIA DE MEMRIA PARA REDUO DE CONSUMO DE ENERGIA
COM BASE EM OTIMIZAO POR ENXAME DE PARTCULAS (PSO)
Cordeiro, F.R.; Caraciolo, M.P.; Ferreira, L.P. and Silva-Filho, A.G.
Informatics Center (CIn)
Federal University of Pernambuco (UFPE)
Av. Prof. Luiz Freire s/n Cidade Universitria Recife/PE - Brasil
email: { frc, mpc, lpf, agsf }@cin.ufpe.br
RESUMO
Ajuste de parmetros de hierarquia de memria em
aplicaes de plataformas embarcadas podem
dramaticamente reduzir o consumo de energia. Este artigo
apresenta um mecanismo de otimizao que visa ajuste de
parmetros em hierarquia de memria com dois nveis
considerando instrues e dados separados para ambos os
nveis. O estratgia proposta usa otimizao por enxame de
partculas (Particle Swarm Optimization - PSO) visando
prover ao projetista suporte a deciso. Este mecanismo visa
reduzir consumo de energia e melhoria de desempenho de
aplicaes embarcadas. Este mecanismo de otimizao
encontra um conjunto de configuraes de hierarquia de
memria (Pareto-Front) e oferece suporte ao projetista da
arquitetura visando prover um conjunto de solues no
dominantes para tomada de deciso. Resultados para 4
aplicaes do Mibench suite benchmark foram comparados
com outra tcnica evolucionria e observou-se melhores
resultados em todos os casos analisados.
1. INTRODUO
Atualmente a construo de circuitos integrados para o
desenvolvimento de sistemas embarcados est cada vez
mais presente em diversas reas, tais como robtica,
automotiva, eletrodomsticos e dispositivos portteis. Cerca
de 80% dos circuitos desenvolvidos so destinados a
aplicaes de sistemas embarcados, cuja demanda cresce
cada vez mais devido ao rpido desenvolvimento de
tecnologias mveis e aumento de sua eficincia.
A complexidade de se projetar sistemas de circuitos
integrados cresce a cada dia, dobrando a densidade de
transistores a cada 18 meses, segundo a lei de Moore. Esse
aumento na complexidade implica na agregao cada vez
maior de funcionalidades em equipamentos de menor
volume, associados a um menor custo, menor consumo de
potncia e melhor desempenho.
Com a expanso e o desenvolvimento de aplicaes de
sistemas embarcados, o mercado tem requerido solues
rpidas e eficientes em torno de parmetros como
desempenho, rea e energia que uma aplicao pode
consumir. A anlise desses parmetros deve ser feita de
forma rpida a fim de atender a demanda do mercado.
Grande parte dos circuitos integrados desenvolvidos
para aplicaes embarcadas contm processadores
heterogneos e freqentemente memrias caches. Sabe-se
que atualmente o consumo de energia de hierarquias de
memria pode chegar at a 50% da energia consumida por
um microprocessador [1][2]. Desta forma, otimizando-se a
arquitetura de memria possvel obter uma reduo do
consumo de energia do processador e, conseqentemente,
do sistema embarcado.
Muitos esforos tm sido realizados a fim de reduzir o
consumo de energia pelo ajuste de parmetros de cache, de
acordo com as necessidades de uma aplicao especfica.
No entanto, uma vez que o propsito fundamental do
subsistema de cache fornecer alto desempenho de acesso
memria, tcnicas de otimizao de cache devem no
apenas economizar energia, mas tambm prevenir a
degradao do desempenho da aplicao.
O ajuste de parmetros de memria cache para uma
aplicao especfica pode economizar em mdia 60% do
consumo de energia [3]. No entanto, encontrar uma
configurao de cache adequada (combinao de tamanho
total, tamanho de linha e associatividade) para uma
aplicao especfica pode ser uma tarefa complexa e pode
requerer um longo perodo de anlise e simulao.
Adicionalmente, o uso de ferramentas que coletem dados
diretamente com o chip pode ser lento, principalmente para
realizao de testes. Desta forma, a utilizao de
simuladores que realizem a anlise dos componentes em
um nvel mais abstrato tem se tornado mais vivel para
atender a velocidade e demanda do mercado.
Outro problema observado que explorao de todas as
possibilidades de configurao pode requerer uma grande
quantidade de tempo, tornando-se invivel a busca
exaustiva por uma melhor soluo. Para hierarquias de
memria com um nvel de cache, variando tamanho total de
cache, tamanho de linha e associatividade, possvel obter
dezenas de configuraes com caractersticas especficas
[4]. Em hierarquias de memria que envolvem um segundo
nvel de cache, onde ambas so separadas em instrues e
dados, centenas de configuraes so possveis [5].
Adicionalmente, para hierarquias de memria que
69
envolvem um segundo nvel unificado de cache, as
possibilidades envolvem milhares de configuraes que
poderiam ser testadas, devido a interdependncia entre
instrues e dados [6].
Com o intuito de reduzir o conjunto de simulaes
necessrias para se encontrar uma configurao que esteja
entre as melhores possveis, alguns mecanismos de busca
tm sido propostos na literatura.
A anlise estatstica desses mecanismos importante
para detectar em que situaes cada abordagem traz
maiores benefcios na otimizao de memria cache. Desta
forma, quando se prope uma nova tcnica deve ser feito
um estudo sobre o desempenho obtido e uma anlise
comparativa com tcnicas existentes.
Neste trabalho realizado uma implementao do
algoritmo de otimizao por enxame de partculas (PSO)
para otimizar arquitetura de memria, realizando-se uma
anlise estatstica sobre seu desempenho. Os resultados
obtidos so comparados com os resultados de outra tcnica
proposta na literatura TEMGA baseado em algoritmos
genticos, analisando-se a eficincia do PSO em relao a
este mecanismo de explorao [8].
2. TRABALHOS RELACIONADOS
Tendo em vista o impacto da reduo de do consumo de
energia devido ao ajuste dos parmetros de cache, muitos
estudos tm sido realizados com o intuito auxiliar o
projetista na escolha desses parmetros. No entanto, as
contribuies para o ajuste de hierarquia de memria cache
de dois nveis, com cache de instruo e dados separados,
tm sido menores devido a sua maior complexidade.
Gordon-Ross et al. [6] extendeu a heurstica de Zhang,
direcionada para caches de um nvel, e props a heurstica
TCaT, direcionada para hierarquia de dois nveis. O uso da
heurstica TCaT permitiu uma reduo de energia de 53%
quando comparado heurstica de Zhang.
Silva-Filho, et al., baseando-se no ajuste dos parmetros
de cache, props as heursticas TECH-CYCLES [7] e
TEMGA[8], onde a ltima foi baseada em algoritmos
genticos. Utilizando-se o TECH-CYCLES foi possvel
observar uma reduo de consumo de energia de 41% para
cache de instrues, enquanto que o TEMGA obteve uma
reduo de 15% para cache de dados.
Pelo nosso conhecimento, nenhum trabalho foi
realizado ainda envolvendo uso de PSO para reduo do
consumo de energia e ciclos em uma aplicao. Por se tratar
de um algoritmo eficiente para problemas de otimizao
interessante observar seu desempenho comparado a outra
tcnica proposta na literatura: TEMGA que se baseia em
algoritmos genticos, da mesma rea que o PSO, que so
mecanismos de busca evolutivos.
3. OTIMIZAAO POR ENXAME DE PARTCULAS
Em termos de computao, o algoritmo de otimizao por
enxame de partculas (PSO) proposta por Kennedy [9]
uma tcnica de soluo de problemas baseado na
inteligncia de enxame, o qual inspirado no
comportamento social de um bando de pssaros. A tcnica
PSO simula o movimento dos pssaros em busca de uma
soluo tima em um espao de busca para um
determinado problema. Esse conceito est interligado ao
modelo simplificado da teoria dos enxames que os
pssaros (partculas) fazem uso da sua prpria experincia
e da experincia do prprio bando para encontrar a melhor
regio de busca. Neste cenrio, as partculas usam
processos de comunicao especficos a fim de chegar a
uma soluo comum adequada, isto , de boa qualidade ao
problema.
O Algoritmo PSO pode convergir em solues sub-
timas, porm sua utilizao assegura que nenhum ponto
do espao de busca tem probabilidade zero de ser
examinado. Toda tarefa de busca e otimizao possui
vrios componentes, entre eles: o espao de busca onde so
consideradas todas as possibilidades de soluo de um
problema, e a funo de avaliao (compensao e custo),
uma maneira de avaliar os membros do espao de busca.
Logo, o algoritmo PSO opera sobre um enxame de
partculas que possui um vetor de velocidades e outro de
posio, a posio de cada partcula atualizada de acordo
com a velocidade atual, o saber adquirido pela partcula e o
conhecimento adquirido pelo bando. Assim, as partculas
podem fazer a busca em diferentes reas do espao de
soluo, de forma que quando uma partcula descobre uma
possvel melhor soluo, todas as outras partculas iro se
mover prximas a ela, explorando esta regio de busca
mais profundamente durante o processo.
A cada iterao t, a velocidade da partcula i atualizada
conforme a equao:

)) ( ) ( (
)) ( ) ( ( ) ( ) 1 (
2 2
1 1
t x t p r c
t x t p r c t v t v
i g
i i i i
G G
G G G G
+
+ = +
, (1)

Onde w um peso de inrcia que controla a capacidade
de explorao do algoritmo, os dois parmetros de
confiana c
1
e c
2
que indicam o quanto uma partcula
confia em si e no bando respectivamente e r
1
e r
2
que so
nmeros gerados aleatoriamente e uniformemente entre 0 e
1. Atualizando a velocidade desta maneira, permite que a
partcula i se mova de acordo com sua melhor posio
encontrada individualmente p
i
, e com a melhor posio
encontrada por todo o enxame p
g.

Baseado na equao de velocidade (1), a nova posio
da partcula calculada segundo a equao:
70

xi (t 1) xi (t) vi (t 1). (2)

Onde a nova posio atualizada com a combinao da
posio anterior e a nova velocidade. Baseado nas
equaes (1) e (2), o enxame de partculas tende a se
agrupar e simultaneamente cada partcula se move
aleatoriamente em vrias direes.
Quando o algoritmo PSO utilizado para solucionar
problemas reais, um enxame inicial de partculas
aleatoriamente gerado, onde cada partcula corresponde a
uma possvel soluo do problema.
Durante o processo evolucionrio do algoritmo, as
partculas se movem pelo espao de busca com a
atualizao dos vetores de posio e velocidade, e aps so
avaliadas, onde se mede o grau de aptido (fitness) das
mesmas. Aptido nesse contexto reflete o quo prximo a
partcula est da regio que contm a soluo tima.
Com as partculas avaliadas, extraem-se o pbest e o
gbest, isto , a melhor posio encontrada pela partcula e
pelo todo enxame respectivamente. Depois da atualizao
das velocidades e posies de cada partcula do enxame,
caso o critrio de parada tenha sido atingido, a soluo do
problema encontrada apresentada. Caso contrrio, aplica-
se novamente a avaliao de fitness a este enxame,
atualizam-se os valores de pbest e gbest,caso seja
apresentada uma soluo melhor, seguido da velocidade e
posio de cada partcula do enxame. O lao prossegue at
o critrio de parada ter sido atingido.
O fluxograma apresentado na Figura 1 representa um
esboo do algoritmo descrito acima [10].


Fig. 1. Fluxograma do Algoritmo PSO

4. PROPOSTA DE OTIMIZAO COM PSO
O modelo proposto neste trabalho uma variao do
algoritmo de otimizao por enxame de partculas para
otimizao da hierarquia da memria cache titulado
TEMPSO (Two-Level Exploration Mechanism based on
Particle Swarm Optimization). Para o processo de
otimizao da memria cache, o TEMPSO mapeia uma
possvel arquitetura cach a uma partcula, onde cada
partcula uma possvel soluo do espao de busca.
O primeiro parmetro a ser definido o espao de
busca para busca da melhor soluo para o problema. No
cenrio alvo deste problema, o espao de busca
definido por atributos discretizados, conforme apresentado
na seo 3. necessrio que os operadores de atualizao
de velocidade e posio das partculas sejam adaptados
para modelar de forma vlida as possveis solues para o
problema. Para isso foi definido um domnio de possveis
valores para velocidades e posies para cada partcula.
Pelo espao de busca j pr-definido, foram gerados os
possveis valores para a velocidade e para a posio das
partculas conforme apresentado na Tabela 2 e 3. Vale
observar que para as velocidades tais valores foram
definidos para que fossem geradas posies vlidas que
representam possveis arquiteturas.

Tabela 2. Possveis velocidades para as partculas.
Parmetro Valores
Velocidades {/4,/2,*1,*2,*4}

Tabela 3. Espao de posies das partculas.
Parmetro Cache Nvel 1 Cache vel 2
Tamanho de Cache 2KB, 4KB e
8KB
15KB, 32KB e
64KB
Tamanho de Linha 8B, 16B e 32B 8B, 16B e 32B
Associatividade 1, 2 e 4 1, 2 e 4

No incio do processo de execuo do algoritmo do
TEMPSO definido um conjunto inicial de partculas que
representam possveis solues para o problema proposto.
Inicialmente, a partcula com menor funo de custo ser
assumida como a melhor arquitetura. A cada iterao,
todas as arquiteturas so avaliadas, baseado em uma
funo de custo calculada a partir do nmero de ciclos e
da energia consumida pela aplicao. Com as partculas
avaliadas, extraem-se o pbest e o gbest, isto , a melhor
posio encontrada pela partcula e pelo todo enxame
respectivamente. Aps so atualizadas as velocidades e
posies das partculas. Devido ao espao de busca ser
restrito e discreto as variaes nas equaes de velocidade
e posio foram propostas a fim de se obter configuraes
vlidas medida que as partculas se movessem pelas
regies de busca. As equaes so apresentadas conforme
a seguir. A equao da velocidade:
v
id
= C1*v
id
C2*(p
id
-x
id
) C3*(p
gd
-x
id
) C4*vrnd() (3)
71
Onde representa um operador lgic
so definidos pesos representados pelo
confiana C1, C2, C3 e C4 que repres
influenciam na probabilidade de que a n
partcula seja a mesma que a anterior
locomovendo ou a velocidade influenc
posio encontrada por ela ou que s
baseada na melhor posio encontrada po
por fim ou uma velocidade aleatria. Es
selecionando aleatoriamente um das poss
as quais esto ponderadas por pesos
parmetros de confiana. O mecanism
velocidade inspirado em um dos meca
do melhor indivduo no processo evolut
genticos: a roleta [11]. A figura 2 ilustr
como seria o processo de seleo a
adaptada para o TEMPSO.

Fig. 2. Mecanismo Atualizao de

A equao de atualizao de posi
frmula:
x
id
= x
id
v
id
;

Onde representa o operador aplica
parmetros de velocidade que so opera
de multiplicao * e diviso /. Base
(3) e (4), as partculas se movem pelas re
busca de uma boa soluo comum e o
sendo posies vlidas de acordo com o
definido para o problema.
Depois da atualizao das velocidad
cada partcula, caso o critrio de parada
melhor arquitetura para o problema a
contrrio, aplica-se novamente a avalia
enxame, atualizam-se os valores de pbest
apresentada uma soluo melhor, seguid
da posio de cada partcula do enxame.
at o critrio de parada ter sido atingido.
Uma anlise variando o nmero
enxame feita na prxima seo a
convergncia do enxame em apresentar bo

5. AMBIENTE EXPERIME
A abordagem proposta nesse trabalho en
do algoritmo PSO para otimizar arquit
cache de dois nveis, com cache de i
co OR. Ou seja,
os parmetros de
sentam pesos que
nova velocidade da
que ele estava se
ciado pela melhor
seja a velocidade
or todo o enxame e
ssa escolha feita
sveis velocidades,
s definidos pelos
mo de seleo de
anismos de seleo
tivo de algoritmos
ra um exemplo de
partir da roleta

e Velocidade
o definida pela
(4)
ado pelos possveis
adores aritmticos
eado nas equaes
egies de busca em
o mais importante
o espao de busca
des e posies de
a seja satisfeito, a
apresentada. Caso
o de custo a este
t e gbest, caso seja
do da velocidade e
O lao prossegue
de partculas do
fim de validar a
oas solues.
ENTAL
nvolve a utilizao
tetura de memria
instruo e dados
separados. Esse tipo de arqui
utilizado no mercado e pode s
seis parmetros: tamanho da
associatividade, para cada um
representa a arquitetura d
processador MIPS, com um p
instruo (IC1) e dados (DM1
nvel com caches de instruo
memria principal (MEM).
utilizada, com poltica de escr
de transistor 6-T de 0.08 m.

Fig. 3. Hierarquia de m
cache de instrues

Foram adotadas configur
aplicaes comerciais, onde
acordo com a Tabela 1. Ajust
hierarquia de cache tem-
combinaes, as quais consti
de todas as configuraes pos

Tabela 1. Espao de Expl
Parmetro Cache
Tamanho de Cache 2KB,
8KB
Tamanho de Linha 8B, 16
Associatividade 1, 2 e

O algoritmo baseado no
trabalho responsvel por enc
dos parmetros de cache,
parmetro, de forma a red
diferentes configuraes.
Foi utilizado um conjun
Mibench benchmark sute [12
Dentre as aplicaes utiliz
Dijkstra_small, Susan_small
um benchmark gratuito e c
para aplicaes de sistemas em
Os valores de consumo d
para executar cada aplica
configurao foram obtidas at
SimpleScalar [13] e eCACTI

6. RESUL
Para os experimentos utiliz
seguir, foram definidos o n
itetura de memria bastante
ser configurada pelo ajuste de
a cache, tamanho de linha e
m dos dois nveis. A Figura 3
descrita, composta por um
primeiro nvel com caches de
1) independentes, um segundo
o (IC2) e dados (DM2), e uma
Uma voltagem de 1.7 V
rita write-through e tecnologia

memria de dois nveis, com
s e dados separados.
raes de cache comuns em
e os parmetros variam de
tando-se os parmetros dessa
-se um conjunto de 458
tuem o espao de explorao
ssveis.
orao das configuraes.
e Nvel 1 Cache vel 2
4KB e 15KB, 32KB e
64KB
6B e 32B 8B, 16B e 32B
4 1, 2 e 4
PSO que proposto nesse
contrar a melhor configurao
atravs do ajuste de cada
duzir o custo de busca por
nto de quatro aplicaes do
2] para realizar as simulaes.
zadas esto Bitcount_small,
e Patricia_small. O Mibench
omercialmente representativo
mbarcados.
de energia e tempo necessrio
o para uma determinada
travs do uso das ferramentas
[14].
LTADOS
zados para as simulaes a
mero de partculas variando
72
entre 5,10,20 e 40. Os pesos das velocidades (Ci) foram
definidos para {2,2,2,0}. As partculas foram inicializadas
aleatoriamente pelo espao de busca e o critrio de parada
de busca adotado foi o nmero de iteraes.
A funo fitness adotada para avaliao das partculas
foi a funo de custo do produto entre o consumo de
energia e a quantidade de ciclos para cada possvel
partcula. O objetivo do problema minimizar essa funo
de custo, ou seja, atingido o pareto-timo.
A anlise do modelo proposto foi dividida em duas
etapas: anlise de desempenho do TEMPSO e anlise
comparativa com o TEMGA. Para a anlise inicial de
desempenho do TEMPSO foi utilizada a aplicao
Bitcount_small, do sute Mibench. Nessa etapa foi
realizado um estudo do comportamento do algoritmo
variando o nmero de partculas em 5, 10, 20 e 40. Para
cada anlise de configurao do TEMPSO foram
realizadas 30 simulaes, a fim de se obter a mdia dos
resultados.
A Tabela 4 apresenta mdia e o desvio padro dos
valores de Energia (Joules) x Ciclos das melhores
configuraes encontradas nas simulaes com as
diferentes configuraes de nmero de partculas. Pode-se
observar que quanto maior nmero de partculas melhores
solues so apresentadas.

Tabela 4. Resultados da variao do nmero de partculas.
Configurao
(TEMPSO)
Num. De part.
Joules x Cycles
T = 10
Joules x Cycles
T = 5
Mdia Desv. P. Mdia Desv. P.
5 1521,724 298,96 1612,922 427,0936

10 1367,032

163,28 1380,402

160,4924
20 1305,7163

28,3961

1328,27 38,1659

40 1296,02887 15,367 1300,06 20,9466

Outro estudo realizado foi feito observando-se a
velocidade de convergncia do algoritmo para cada
variao do nmero de partculas. A convergncia foi
observada em relao funo fitness, que definida no
algoritmo proposto como o produto entre energia
consumida e quantidade de ciclos. O grfico que representa
essa anlise apresentado na Figura 4.
Segundo a Figura 4, a variao com 40 partculas
converge mais rpido. Isso ocorre porque existe uma maior
quantidade de partculas buscando a melhor soluo
simultaneamente. Isto prova uma maior comunicao da
melhor soluo entre todo o enxame, incrementando a
capacidade de convergncia e explorao do mesmo. A
variao de 20 partculas obteve resultado prximo da de
40, mostrando que o efeito maior est entre a variao de 5
para 10 partculas.
Adicionalmente, foi feito um estudo em relao ao
nmero de simulaes necessrias por iterao, quando
utilizados diferentes nmeros de partculas na configurao
do TEMPSO. O nmero de simulaes necessrias quando
se utiliza 40 partculas cresce muito, tornando-se invivel
em alguns casos. No entanto, utilizando-se 5 ou 10
partculas o nmero de simulaes aceitvel.


Fig. 4 Anlise de evoluo de convergncia das 4
variaes.

Quando comparado as configuraes encontradas pelo
algoritmo TEMPSO com 5 partculas com todo o espao
de configuraes existentes foi possvel encontrar bons
resultados. A Figura 6 mostra as configuraes obtidas em
relao ao espao de busca pelo mtodo exaustivo, para a
aplicao Dijkstra_small.



Fig 5. Configuraes encontradas em relao
ao espao de busca.

Como pode ser observado na Figura 5, as configuraes
encontradas esto entre as que possuem os menores valores
de consumo de energia e algumas delas esto tambm entre
as que tm menor nmero de ciclos. Isso demonstra a
eficincia do mecanismo proposto em encontrar as
melhores configuraes do espao de explorao.
Na segunda etapa de anlise foi realizado um estudo
comparativo entre o TEMPSO e o TEMGA, que um
algoritmo baseado em algoritmos genticos com bom
desempenho na rea de explorao de arquitetura de
memria [8]. Como pde ser observado na anlise da seo
anterior, a configurao do algoritmo proposto que obteve
melhores resultados foi com 40 partculas, porm o nmero
de simulaes necessrias para obter esses resultados
torna-se invivel para aplicaes em sistemas embarcados.
No entanto, tambm foi possvel observar que com a
configurao de 5 partculas o nmero de simulaes
73
necessrias cai bastante, sendo compatvel com outras
tcnicas utilizadas na rea. Desta forma, para realizar a
anlise com o TEMGA foi utilizado o TEMPSO com 5
partculas. Foi realizado um estudo comparativo dos
melhores valores de energia e ciclos encontrados por cada
tcnica. A Tabela 5 ilustra essa comparao, onde a coluna
AP com os valores 1, 2, 3 e 4 representam as aplicaes
Bitcount_small, Dijkstra_small, Patricia_small e
Susan_small, respectivamente.

Tabela 5 Anlise comparativa de funo de custo
(energia x ciclos) entre TEMPSO e TEMGA.
AP TEMPSO
(Joules)
TEMPSO
(Ciclos)
TEMGA
(Joules)
TEMGA
(Ciclos)
Valor
timo
(Joules)
Valor
timo
(Ciclos)
1 2,925E-4 5,1664E+6 3,04E-4 5,166E+6 2,502E-4 5,163E+6
2 74,79E-4 22,625E+6 75,17E-4 22,58E+6 42,699-4 21,287E+6
3 147 E-4 43,973E+6 151,3E-4 43,17E+6 127,5E-4 42,844E+6
4 6,690E-4 5,2406E+6 6,82E-4 5,196E+6 6,367E-4 5,1745E+6

Conforme apresentado na Tabela 5, a melhor
configurao obtida pelo TEMPSO alcanou valores de
energia menor do que o TEMGA em todas as aplicaes
simuladas. Em relao quantidade de ciclos os resultados
foram equivalentes. Analisando-se em termos de valores
timos as solues obtidas obtiveram valores prximos s
solues timas.
7. CONCLUSO
Nesse trabalho foi proposto um novo modelo para
otimizao de arquitetura de memria cache, denominado
TEMPSO, onde foi realizada uma anlise estatstica sobre
seu desempenho e foi feito um estudo comparativo com
outra tcnica na rea, o TEMGA.
Foram utilizadas 4 aplicaes do sute Mibench para
validar o desempenho do TEMPSO, considerando-se um
ambiente de arquitetura de memria cach de dois nveis,
com cach de dados e instrues separados. Na anlise
realizada, foi possvel observar que o algoritmo proposto
conseguiu obter melhores resultados em termos de energia
consumida em todas as aplicaes observadas, quando
comparado com o TEMGA. Em relao ao nmero de
simulaes tambm foi possvel observar um melhor
desempenho do algoritmo proposto, onde o TEMPSO
obteve convergncia mais rpida ou igual ao TEMGA,
sendo necessrias menos simulaes para se atingir a
melhor configurao encontrada pelo algoritmo.
Em trabalhos futuros pretende-se realizar otimizaes
do TEMPSO, alm de estender as aplicaes analisadas,
obtendo-se aplicaes de outros benchmarks.
8. REFERNCIAS
[1] M. Verma, and P. Marwedel, Advanced Memory
Optimization Techniques for Low-Power Embedded
Processors, Springer, Netherlands, 2007.
[2] M. Kandemir and A. Choudhary. Compiler-Directed Scratch
Pad Memory Hierarchy Design and Management, In
Proceedings of Design Automation Conference (DAC02),
New Orleans, USA, Jun. 2002.
[3] C. Zhang, F. Vahid, Cache configuration exploration on
prototyping platforms. 14
th
IEEE Interational Workshop on
Rapid System Prototyping (June 2003), vol 00, p.164.
[4] TRIOLI, M. F.(2004). Introduo Estatstica. So Paulo:
LTC.
[5] Gordon-Ross, Ann, Vahid, F., Dutt, Nikil, Automatic
Tuning of Two-Level Caches to Embedded Aplications,
DATE, pp.208-213 (Feb 2004).
[6] Gordon-Ross, Ann, Vahid, F., Dutt, Nikil, Fast
Configurable-Cache Tuning with a Unified Second-Level
Cache, ISLPED05, (Aug 2005).
[7] A.G. Silva-Filho, F.R. Cordeiro, R.E. SantAnna, and M.E.
Lima, Heuristic for Two-Level Cache Hierarchy
Exploration Considering Energy Consumption and
Performance, In: (PATMOS06), pp. 75-83, Sep 2006.
[8] A.G. Silva-Filho, C.J.A. Bastos-Filho, R.M.F. Lima, D.M.A
Falco, F.R. Cordeiro and M.P. Lima. An Intelligent
Mechanism to Explore a Two-Level Cache Hierarchy
Considering Energy Consumption and Time Performance,
SBAC-PAD, pp 177-184, 2007.
[9] J. Kennedy and R. C. Eberhart, Particle swarm
optimization, in Proc. of the IEEE Int. Conf. on Neural
Networks. Piscataway, NJ: IEEE Service Center, 1995, pp.
19421948.
[10] D. Bratton and J. Kennedy, Defining a standard for particle
swarm optimization, in Swarm Intelligence Symposium,
2007. SIS 2007. IEEE, Honolulu, HI, Apr. 2007, pp. 120
127.
[11] Mitchell, M.; An Introduction to Genetic Algorithms, MIT
Press, 1998.
[12] Guttaus, M. R.; Ringenberg, J.S.; Ernst, D.; Austin, T.M.;
Mudge, T.; Brown, R.B.; Mibench: A free, commercially
representative embedded benchmark suite. In IEEE 4th
Annual Workshop on Workload Characterization, pp.1-12,
December 2001.
[13] Dutt, Nikil; Mamidipaka, Mahesh; eCACTI: An Enhanced
Power Estimation Model for On-chip Caches, TR04-28;
set. 2004.
[14] Burger, D.; Austin, T.M.; The SimpleScalar Tool Set,
Version 2.0; Computer Architecture News; Vol 25(3),
pp.13-25; June 1997.
74
IP-CORE DE UMA MEMRIA CACHE RECONFIGURVEL
Gazineu, G.M.; Silva-Filho, A.G.; Prado, R.G.; Carvalho, G.R.; Arajo, A.H.C.B.S. and Lima, M.E.
&HQWURGH,QIRUPiWLFD (CIn)
8QLYHUVLGDGH)HGHUDOGH3HUQDPEXFR (UFPE)
Av. Prof. Luiz Freire s/n Cidade Universitria Recife/PE - Brasil
email: { gmg2, agsf, grc, ahcbsa, mel}@cin.ufpe.br
RESUMO
Este trabalho visa o desenvolvimento de um IP-Core de
uma memria cache reconfigurvel. A arquitetura foi
desenvolvida de tal forma que se permite sua extenso e
conexo com processadores soft-cores. Neste trabalho, a
arquitetura de memria foi dividida de tal forma que apenas
o tamanho da cache pudesse ser reconfigurado, com base na
quantidade de linhas da memria. O IP-Core foi validado
atravs de simulao, e uma anlise em termos de rea da
arquitetura de memria foi realizado visando fazer uma
anlise prvia do dispositivo de FPGA em funo do
tamanho da cache. Uma equao foi obtida como resultado
desta relao e validado com base em dois estudos de caso.
1. INTRODUO
A computao reconfigurvel tem o intuito de diminuir
o espao existente entre os paradigmas de hardware e
software, possibilitando os projetistas de computadores uma
nova viso de desenvolvimento [1]. Apesar de a
computao reconfigurvel ser bastante recente e seus
conceitos ainda no estarem firmados, esta tecnologia
permitiu a implementao em nvel de hardware, mantendo
o alto desempenho das implementaes, e agora tambm
com uma flexibilidade que antes no existia.
Diversas tecnologias de FPGAs esto disponveis
no mercado mundial, no entanto, poucas possuem
tecnologia suficiente para prover reconfigurabilidade
parcial. Dentre os maiores fabricantes (Xilinx e Altera)
[4][5], os FPGAs da Xilinx possibilitam que parte da lgica
reconfigurvel possa ser reconfigurada, permitindo desta
forma, que aplicaes tais como uma cache reconfigurvel
seja implementada.
Ajustar uma hierarquia de memria em tempo de
execuo pode ser bastante til tendo em vista que nem
todas as aplicaes so ideais para uma determinada
arquitetura de memria e as solues comerciais baseadas
em ASICs no permitem que sejam reconfiguradas em
tempo de execuo. Por outro lado, caches reconfigurveis
possibilitam que arquiteruras de memria sejam ajustadas
visando atender restries de projeto tais como
desempenho, rea e consumo de energia.

O modelo de memria cache implementado foi baseado
em estudos sobre arquiteturas de cache existentes [2]. O
intuito deste trabalho no foi desenvolver novos tipos de
mapeamento e algoritmos de substituio da cache, a idia
foi codificar um modelo de cache existente usando a
linguagem de descrio VHDL visando prover uma
arquitetura de uma memria cache reconfigurvel.
Este artigo um trabalho inicial que est
direcionado na obteno de um modelo real no nvel RTL,
que permita ser conectado a um processador soft-core.
Adicionalmente, uma avaliao em termos de ocupao do
FPGA em funo do dispositivo tambm realizada.
2. DESCRIO E MODELAGEM DA CACHE
Uma cache totalmente parametrizvel um projeto bastante
complexo e pode resultar em muito tempo de
implementao. Focamos este trabalho, inicialmente na
avaliao do tamanho da cache e deixamos para um
trabalho posterior parametrizaes em outros parmetros
tais como o tamanho da linha e associatividade da cache.
Como base em nossas anlises, foi implementada uma
estrutura bsica de memria cache de acordo com a Figura
1.
Nesta etapa de nossos estudos estamos preocupados em
analisar os efeitos da variao da quantidade de linhas da
memria cache em termos de rea ocupada num dispositivo
reconfigurvel (FPGA) para vrios dispositivos da famlia
Virtex II da Xilinx. Dentre algumas vantagens de escolha
de tal famlia de FPGA esto: preos ainda competitivos no
mercado quando comparados com a famlia Virtex 6, assim
como o potencial de reconfigurabilidade parcial suportado
pelos FPGAs da Xilinx, essencial para projetos envolvendo
caches reconfigurveis. Adicionalmente, importante
deixar claro que o trabalho desenvolvido permite ser
aplicado para famlias mais recentes de FPGAs. Encontrar o
FPGA adequado para um determinado SoC muitas vezes
torna-se atrativo considerando que os custos so elevados
entre os diferentes dispositivos para uma mesma famlia.
O trabalho de Mamidipaka e Dutt [3] descreve em nvel
de detalhes os principais componentes de uma arquitetura
de cache. Este trabalho foi fundamental para o
entendimento do projeto, para esclarecer algumas idias e
75
ajustar a descrio final do trabalho antes de partir para
implementao.



Fig. 1. Arquitetura de Memria Cache desenvolvida


O projeto final da arquitetura da cache ficou definido da
seguinte maneira (Figura 1): um mdulo decodificador, um
comparador de bits, dois buffers de armazenamento
temporrio (um buffer para armazenar dados e outro para
armazenar um endereo), um banco de armazenamento para
guardar dados e instrues, outro banco para armazenar
rtulos identificadores de linha e o mdulo crebro da
cache, o controlador.
O decodificador responsvel por receber o endereo
que vem do processador e definir qual a linha da cache vai
ser endereada na operao de escrita/leitura da cache.
Quando o processador deseja executar uma operao na
cache, ele envia um endereo que referencia uma nica
linha na cache e envia tambm sinais de controle para o
controlador da cache.
O comparador um modulo que compara duas palavras
(string de bits) bit-a-bit para verificar se seus valores so
iguais. O comparador, dentro da cache, tem a funo de
sinalizar ao controlador se a instruo (ou dado) requerida
pelo processador esta ou no na cache.
O buffer desempenha um papel importante da
arquitetura de cache. O buffer foi inserido na arquitetura de
cache para tentar amenizar o trfego no barramento do
sistema com solicitaes da cache. A quantidade de buffers
varia dependendo da implementao e do algoritmo de
atualizao da cache (aplicamos o algoritmo escrita de
volta). No desenho esquemtico da cache na Figura 10 s
contm dois buffers, um para armazenar dados e outro para
endereo.
Quando o processador realiza uma leitura na cache,
duas operaes so executas em paralelo para que o
resultado da leitura seja retornado mais rapidamente. Uma
das operaes selecionar a linha da cache (a informao)
que ser retornada, e a outra saber (comparando rtulos)
se esta linha selecionada a esperada pelo processador.
A compreenso desta abordagem muito simples, o
dado selecionado armazenado no buffer
independentemente do resultado da comparao (no
comparador), se houver um acerto (cache-hit) na
comparao, ou seja, o dado selecionado o esperado pelo
processador, o controlador libera a informao que esta
armazenada no buffer de dados, caso o resultado da
comparao seja negativa (cache-miss) o controlador
apenas descarta a informao que esta armazenada no
buffer e parte para a atualizao da linha da cache.
Na arquitetura codificada foram usados dois bancos de
armazenamento. Um dos arrays foi implementado para
comportar os dados e as instrues necessrias para
execuo do processador, e o outro para conter somente os
rtulos identificadores das linhas.
Um dos componentes mais importantes da arquitetura
de cache sem dvidas controlador, ele quem gerencia
toda comunicao entre os outros mdulos, recebendo
sinais, tratando-os e decidindo o que fazer. A cada novo
ciclo de mquina, o controlador da cache adota um
comportamento diferente, direcionando o fluxo dentro da
arquitetura da cache atravs de sinais de controle.
3. ESTADOS DA CACHE
O controlador da cache desenvolvido utilizou uma nica
mquina de estado para modelar quatro possveis fluxos de
informaes dentro da cache, que so: uma leitura com
acerto, uma leitura com erro, uma escrita com acerto e uma
escrita com erro. A Figura 2 representa o diagrama da
mquina de estado utilizada.
O sinal de controle reset inicializa todos os
componentes do sistema, quando setado para nvel lgico
alto (um) o controlador vai para o estado inicial, as
memrias (dados e rtulos) e os buffers (dados e endereo)
so apagados, se ele estiver em nvel lgico baixo (zero) o
sistema permanece o mesmo (sem alteraes).



Fig. 2. Diagrama da mquina de estado do controlador.

No primeiro estgio da mquina de estado, as variveis
e sinais de controle so inicializados e o controlador
aguarda o sinal de controle que habilita a operao. Este
76
sinal indica se o processador deseja executar uma escrita ou
uma leitura na cache. Se o sinal de controle rws estiver em
nvel lgico alto, o controlador executa uma operao de
escrita, caso contrrio efetua uma operao de leitura (nvel
lgico baixo).
A Figura 3 um diagrama minimizado da mquina de
estado da operao de leitura. No estado inicial o
controlador decide se vai executar uma escrita ou leitura
baseado num sinal de controle chamado de rws (read-
write signal).



Fig. 3. Mquina de Estados da operao de leitura.


O prximo passo do controlador saber se a informao
desejada esta (ou no) na cache. O estado compara, do
diagrama de leitura, responsvel pela comparao dos
rtulos, neste estgio o controlador aguarda o resultado do
mdulo comparador, se o dado estiver na cache (acerto -
hit) a mquina de estado pula para o prximo passo que
somente liberar o dado para o processador.
Caso haja um erro na leitura da cache (miss), o
controlador precisa atualizar a linha da cache com um novo
bloco da memria principal antes de liberar a informao
para o processador. No estado de atualizao da cache (ver
Figura 14), o controlador envia o endereo do processador
para memria principal, solicitando o novo bloco para troca
na cache. Neste estado o controlador fica aguardando que a
memria principal fornea o novo bloco e sinalize
(ready_mem = 1) que o novo dado j esta disponvel para
atualizao, enquanto isso no ocorrer o controlador
permanece no mesmo estado at a nova entrada ser
recebida.
O ltimo estado do controlador de retorno da
informao buscada pelo processador. Neste estgio o
controlador apenas libera a informao que esta
armazenada no buffer dados para o processador e sinaliza o
trmino da operao (ready_proc = 1).
No estado de atualizao de cache, o controlador
sobrescreve a linha referenciada pelo endereo
(decodificada no estado inicial) com o dado que enviado
pelo processador. Quando toda a operao de escrita na
cache terminada, o controlador fica incumbido de
atualizar a memria principal.
No estado de atualizao da memria principal, o
controlador envia o endereo para a memria (endereo
armazenado previamente no buffer) e envia os sinais de
controle que habilita a memria principal e sinaliza a
escrita. Neste estado o controlador aguarda uma resposta da
memria principal, sinalizando que a escrita foi efetuada
com sucesso (ready_mem = 1), caso isso no acontea o
controlador permanece neste estado durante o prximo
ciclo.
O ltimo estado o de finalizao da operao de
escrita. Neste estado o controlador zera todos os sinais
internos da cache, sinaliza (ready_proc = 1) para o
processador o trmino da operao de escrita e vai para o
estado inicial da mquina de estados.



Fig. 4. Mquina de Estados da operao de escrita.

O diagrama de estados da operao de escrita tem trs
estados, alm do estado inicial descrito anteriormente. Estes
estados so: atualizao da cache, atualizao da memria e
finalizao (ver Figura 4).
No estado de atualizao de cache, o controlador
sobrescreve a linha referenciada pelo endereo
(decodificada no estado inicial) com o dado que enviado
pelo processador. Quando toda a operao de escrita na
cache terminada, o controlador fica incumbido de
atualizar a memria principal, que o prximo estado
(atualizao da memria) aps a atualizao da cache.
No estado de atualizao da memria principal, o
controlador envia o endereo para a memria (endereo
armazenado previamente no buffer) e envia os sinais de
controle que habilita a memria principal e sinaliza a
escrita. Neste estado o controlador aguarda uma resposta da
memria principal, sinalizando que a escrita foi efetuada
com sucesso (ready_mem = 1), caso isso no acontea o
controlador permanece neste estado durante o prximo
ciclo.
Quando um dado escrito na cache, to brevemente ele
atualizado na memria principal.
77
4. COMPONENTES INSTANCIADOS NO FPGA
O trabalho visa o desenvolvimento de uma estrutura bsica
de uma memria cache com esquema de escrita mapeado
diretamente, poltica de substituio write-through e
tamanho da palavra fixada em 32 bits. A abordagem
proposta divide a rea do FPGA em duas partes: (i) uma
que ser reconfigurvel dinamicamente composta de Dados
e Rtulos e (ii) outra parte fixa composta de controlador,
comparador, decodificador e buffers.
Os elementos que fazem parte do conjunto
reconfigurvel so os componentes que variam
consideravelmente quanto a sua estrutura fsica. A Figura 7
mostra uma viso geral da implementao da cache dentro
de um FGPA indicando parte fixa e reconfigurvel. No lado
esquerdo esto os arrays de dados e rtulos que crescem de
tamanho de acordo com a variao das quantidades de
linhas da memria cache e so eles que aumentam a regio
ocupada dentro do FGPA. No lado direito da figura esto os
componentes que tem a implementao fixa. Entre os
componentes da parte fixa na verdade existem dois
componentes que sofrem pequenas mudanas no cdigo: o
decodificador e comparador.
O decodificador recebe sempre o mesmo endereo de
32 bits, mas a quantidade de bits (sadas do mdulo) que
so designados para enderear a linha e o rtulo (para
comparao) so diferentes de implementao para
implementao. J o mdulo comparador tem sempre a
mesma sada, que um bit que sinaliza o resultado da
comparao, mas em compensao tem sempre as entradas
diferenciadas entre os projetos, pois os tamanhos das
palavras (rtulos) comparadas dependem de qual
organizao de cache est utilizada. As variaes ocorridas
por esses componentes so to pequenas que no causam
alteraes na quantidade de blocos lgicos que os
implementam no FPGA.


Fig. 7. Elementos da arquitetura de cache.

5. ANLISE DE REA E DISPOSITIVO
Nesta etapa dos estudos preocupou-se em analisar os efeitos
da variao da quantidade de linhas da memria cache em
termos de rea ocupada num dispositivo reconfigurvel
(FPGA) para vrios dispositivos da famlia Virtex II da
Xilinx.
Foram geradas cinco implementaes de cache variando
somente os arrays de armazenamento e mantendo os outros
componentes fixos. Depois de coletar a variao de espao
dentro do FPGA para cada um dos cinco tamanhos de
cache, foi gerada uma tabela reunindo todas as informaes
relevantes. A Tabela 1 mostra relao entre o tamanho da
cache e quantidade de unidades lgicas (LUTs) que foram
necessrias para implement-las, levando em considerao
a arquitetura final da cache com os componentes, as
matrizes de roteamento e os circuitos de interfaceamento.
possvel verificar ainda na tabela, a rea ocupada (em
valores percentuais) em cada uma das implementaes da
cache. Pode-se observar que as caches de 128 e 256 kb
ocuparam 100% do FPGA (overmap) em alguns
dispositivos da famlia Virtex II (clulas amarelas na
tabela), fazendo com que estes dispositivos no fossem
suficientes para acomodar tais implementaes.

Tabela 1. Quantidade de LUTs por tamanho de cache no FPGA da famlia Virtex II.
CACHE N de LUTs xc2v500 xc2v1000 xc2v1500 xc2v2000 xc2v3000 xc2v4000 xc2v6000
16kb 741 12% 7% 5% 3% 2% 1% 1%
32kb 1545 25% 15% 10% 7% 5% 3% 2%
64kb 3084 50% 30% 21% 14% 10% 6% 4%
128kb 6184 101% 60% 40% 28% 21% 13% 9%
256kb 12404 201% 121% 81% 57% 43% 26% 18%


Os dispositivos xc2v4000 e xc2v6000 tm uma
quantidade de blocos lgicos muito grande, fazendo com
que todas as diferentes implementaes de cache
ocupassem menos de 30% destes dispositivos. Se o escopo
desta monografia estivesse analisando uma arquitetura com
outros componentes, por exemplo, microprocessador,
memria principal, displays, com certeza os melhores
(ideais) dispositivos da famlia Virtex II seriam os devices:
xc2v1500, xc2v2000 e xc2v3000.
A partir da Tabela 1, foi possvel representar
graficamente a relao entre o tamanho de cache e a
quantidade de LUTs utilizadas em cada implementao. O
grfico ficou simples, pois se for selecionado um mesmo
tamanho de cache, a quantidade de LUTs necessria para
78
implement-la a mesma, independente do dispositivo para
uma mesma famlia de FPGA.
Esse grfico importante para visualizar os pontos que
sero usados para encontrar a equao que determinar qual
o FPGA ideal para um projeto de hardware desenvolvido
por um projetista. A Figura 8 mostra em uma escala real a
variao linear da quantidade de LUTs para caches de
tamanhos 16k, 32k, 64k, 128k e 256kbytes (Este grfico foi
traado usando os valores das duas primeiras colunas da
Tabela 1).


Fig. 8. Relao tamanho da cache x LUTs.


Atrelado a este grfico outra informao necessria
para decidir sobre qual dispositivo FPGA deve ser usado
dado uma configurao de cache: a quantidade mxima de
LUTs permitido para cada dispositivo (Tabela 2). Atravs
desta quantidade de LUTs que determinado, com o
auxlio da Tabela 2, qual o dispositivo ideal para a
implementao.

Tabela 2. Quantidade de LUTs por dispositivo da
famlia VirtexII da Xilinx

Device LUTs
xc2v80 1024
xc2v250 3072
xc2v500 6144
xc2v1000 10240
xc2v1500 15360
xc2v2000 21504
xc2v3000 28672
xc2v4000 46080
xc2v6000 67584
xc2v8000 93184


Visando obter uma equao que representasse a rea do
FPGA em termos de LUTs, foi realizada uma regresso
linear sobre os pontos obtidos na figura 8 e obtivemos
como resultado a seguinte equao linear:

QLUT = 48.59 * SIZE - 36.53 (1)
onde SIZE o tamanho da cache em kbytes e QLUT a
quantidade de LUTs para um dado tamanho de cache.
6. EXEMPLOS
Em uma implementao real, outros componentes devem
ser considerados tais como processador e memria
principal. Com isso, para exemplificar a abordagem
proposta, considere um sistema composto de processador
Leon2, memria cache e memria principal.
Nessa abordagem, consideramos que o FPGA est
dividido em duas partes: uma parte fixa (no
reconfigurvel) e uma parte reconfigurvel como ilustrado
na figura 3. A parte fixa composta de alguns
componentes fixos da memria cache (controlador,
decodificador, comparador e buffers), memria principal,
processador e mdulo ICAP, perfazendo um total de
aproximadamente 5000 LUTs. Na parte reconfigurvel
encontra-se apenas o array de dados e tag da cache que
varia de acordo com sua configurao.
Consideramos o mtodo de reconfigurao parcial
usando o ICAP (Internal Configuration Access Port) que
pode ser instanciado e est disponvel como recurso de
lgica interna do FPGA. A principal vantagem da
utilizao do ICAP para aplicaes envolvendo caches
reconfigurveis o poder de auto-reconfigurao do
FPGA, possibilitando, por exemplo, que a prpria
aplicao se ajuste a uma nova configurao de cache.









Fig. 9. Elementos da arquitetura de cache.


Exemplo 1: Considere uma cache de 64kbytes, qual FPGA
deve ser usado neste projeto?

QLUT = 48.59 * (64) 36.53
QLUT = 3073.23 LUTs


0
2000
4000
6000
8000
10000
12000
14000
0 50 100 150 200 250 300
Cache Size (kBytes)
L
U
T
s
ICAP
C
MEM
Controller
Decoder
Comp.
B
u
f
F i x e d R e c o n f i g u r a b l e
Data Tag
ICAP
C
MEM
Controller
Decoder
Comp.
B
u
f
F i x e d R e c o n f i g u r a b l e
Data Tag
79
Neste caso, obtivemos um total de 3073.23 LUTs. Esta
quantidade acrescida de 5000 LUTs da parte fixa resulta
em um total de 8073.23 LUTs. Com o auxlio Tabela 1,
verificamos que o dispositivo xc2v1000 se enquadra nas
condies de tal aplicao. Possivelmente dispositivos que
contm quantidades mximas de LUTs acima do escolhido
acarretaria um aumento relevante no custo do projeto.
Considerando que poucas modificaes so realizadas
quando considerada uma hierarquia de memria cache,
esta abordagem pode ser estendida. Com isso, considere
outro exemplo como a seguir.

Exemplo 2: Temos uma hierarquia de cache composta de
cache de instruo (CI) de 32kbytes e cache de dados
(CD) de 128kbytes. Qual o FPGA que deve ser usado neste
projeto?
QLUT(CI) = 48.59 * (32) 36.53 = 1518.35
QLUT(CD) = 48.59 * (128) 36.53 = 6182.99
TOTAL = (5000LUTs)
fixa
+ 1518.25 + 6182.99
TOTAL = 12701,24 LUTs

Device = xc2v1500

Resumindo, a equao (1) ajuda ao projetista, antes da
etapa de implementao, decidir qual dispositivo FPGA da
famlia Virtex II deve ser usado no desenvolvimento de um
projeto contendo hierarquia de memria cache e
processador, a partir do tamanho da cache em kbytes.

7. CONCLUSO

Foi implementado em VHDL uma memria cache e seu
funcionamento validado atravs de simulao. Foi
apresentada uma nova equao que permite mensurar a rea
ocupada por uma memria cache em termos de LUTs. Este
trabalho possibilita demonstrar que a famlia Virtex II da
Xilinx consegue prover satisfatoriamente, em termos de
espao ainda disponvel, sistemas de caches reais.
Considerando um SoC composto de processador e
hierarquia de memria, e de posse da rea ocupada pelo
processador, possvel atravs da abordagem proposta,
escolher o dispositivo da famlia Virtex II adequado para
determinada aplicao. Outras relaes podem ser obtidas
considerando outras famlias de FPGAs que suportem
reconfigurao parcial.
8. REFERNCIAS
[1] MARTINS, C.A.P.S., ORDONEZ, E.D.M. e CARVALHO,
M.B. Computao Reconfigurvel: Conceitos, Tendncias e
Aplicaes, Disponvel em: http://ftp.inf.pucpcaldas.br/CDs/
SBC2003/pdf/arq0251.pdf ; Access: 25/11/2009.
[2] STALLINGS, W. Arquitetura e Organizao de
Computadores: Projeto para o Desempenho. 5. ed. Trad. de
Carlos Camaro de Figueiredo e rev. de Edson Toshimi
Midorikawa. So Paulo: Prentice Hall, 2002.
[3] MAMIDIPAKA, M. e DUTT, N.. eCACTI: An Enhanced
Power Estimation Model for On-chip Caches. Disponvel
em: <http://www.cecs.uci.edu/technical_report/TR04-
28.pdf>. Acesso em: 15 de Fevereiro de 2005.
[4] Xilinx: The Programmable Logic Company, Disponvel em:
<http://www.xilinx.com/ ise/logic_design_prod
/foundation.htm>. Acesso em: 12 de Maro de 2005.
[5] Altera FPGA: Disponvel em: www.altera.com

80
A NOTE ON MODELING PULSED SEQUENTIAL CIRCUITS WITH VHDL
Alberto C. Mesquita Jnior
*

Departamento de Eletrnica e Sistemas, Universidade Federal de Pernambuco
R. Acad. Hlio Ramos, s/n, 50740-530 Recife-PE-Brazil
e-mail: alberto.ufpe@gmail.com
ABSTRACT
In this paper is discussed how to use VHDL to describe
pulsed sequential circuits. It is emphasized the difficulty in
elaborating such descriptions once it seems that VHDL is
not well provided with attributes or resources for detecting
the occurrence of pulses. Examples are presented.
1. INTRODUCTION
In a level synchronous sequential circuit, the state changes
on the rising or falling edge transition of the clock signal,
and the next state depends on the logic levels present at the
inputs and on the past logic levels of inputs and outputs. In
a pulsed synchronous sequential circuit, the state changes
only after the occurrence of a pulse (positive or negative)
at one of the inputs and the next state depends on the past
combinations of such pulses. This paper is divided in three
sections, this introduction is the first. Next, the hardware
model is explained, examples are synthetized and
discussed. Finally, the conclusions.
2. THE HARDWARE MODEL
The hardware model of a pulsed synchronous sequential
circuit is shown in figure 1. A pulse signal is characterized
by the successive occurrence of two opposite transitions.
If a signal is at rest in the low logic level, a positive pulse
occurs after a rising transition followed by a falling
transition on this signal. The schematic and the VHDL
description of an S-C pulsed master-slave flip-flop are
shown, respectively, in figures 2 and 3 as an example of a
pulsed memory cell. Note that is forbidden simultaneous or
overriding pulses on S and C. Now, with the behavioral
VHDL description of this type of memory cell, one is able
to elaborate structural VHDL descriptions of general
pulsed sequential circuits. For the purpose of this paper,
only behavioral descriptions are developed and one has to
develop models for the next logic state for this kind of
sequential machines. The VHDL codes shown were
compiled and simulated using the Quartus II 9.0sp1 Web



































Edition with the standard options. For all models and
examples, three devices were used for the evaluation. I
mean, each VDHL description of the hardware models and
examples present in this paper was compiled and simulated
for three cases: the first case used the EPM7032SLC44-5
Max 7000S device; the second, the EP1S10F484C5 Stratix
and the third, the EP2S15F484C3 Stratix II.

Fig.2. Schematic of the S-C Pulsed Master-Slave Flip-
Flop.
Q
Q
m
Q
m
Q

master latch slave latch
C
S
combinational
logic
Mealy
Z
Mealy m
1
pulses
combinational
logic
Moore
Z
Moore
m
2
m
3
memory cell
(pulsed)
X

Q

s

n

Fig.1. Pulsed synchronous sequential circuit model.
*
The author would like to express his gratitude to Dr. Edval J.
P. Santos for reading this paper and making suggestions for
improvements.
combinational
logic
excitations
81

















2.1. A simple model for the next state logic
In figure 4, a simple model for the next logic state is
presented. It is supposed that timing restrictions are
satisfied.
As an example, the description of a pulsed binary
up/down counter is analyzed. This counter has two pulsed
input lines, Xup e Xdw. When a pulse occurs at Xup, the
counters content is incremented and if a pulse occurs on
Xdw, it is decremented. Only the third case has a good
result produced by the simulation. The VHDL code and the
simulation timing diagram follow in figures 5, 6, 7 and 8
respectively.
2.2. A second model
This second model works in cases where the first model in
subsection 2.1 has failed to work. This new hardware
model is presented in figure 9. The same example was
simulated using this model. The new VHDL code is almost
the same as the last one, except the three lines that begin in
the line labeled next_state_logic. These lines were
rewritten as:



next_state_logic:
dcount<=sumup when sel="10" else
sumdw when sel="01" else
null;




The new VHDL code was compiled and simulated
considering the three cases as above and all simulations
presented good results. The simulation timing diagrams
follows bellow only for the two cases that have not worked
in subsection 2.1. See figures 10 and 11.

























































-- S-C Pulsed Master-Slave Flip-Flop
ENTITY scpmsff IS
PORT (S, C : IN BIT; Q,NQ: OUT BIT);
END scpmsff;

ARCHITECTURE behaviour OF scpmsff IS
signal qm, qmb :BIT;
BEGIN
masterslave: PROCESS (S, C)
BEGIN
IF (S ='1') THEN qm <='1'; qmb <='0';
ELSIF C='1' THEN qm <='0'; qmb <='1';
ELSE Q <=qm; NQ <=qmb;
END IF;
END PROCESS masterslave; END behaviour;
Fig.3. VHDL description of an S-C Master-Slave F-Flop.
Fig.4. A simple next-state hardware structure.
s
















register
n






random
logic
X
D Q
s
library ieee;
use ieee.std_logic_1164.all;

ENTITY dpulsedcounterupdw IS
Generic (max: integer:=15);

PORT (Xup, Xdw: in std_logic;
counting: out integer range 0 to max);
END dpulsedcounterupdw;

ARCHITECTURE behavior OF dpulsedcounterupdw IS
signal dcount, countsl:integer range 0 to max;
signal sumup,sumdw :integer range 0 to max;
signal sel: std_logic_vector (1 to 2);
signal ck: std_logic;


BEGIN
ck<=Xup OR Xdw;
sel<=Xup&Xdw;
counting <=countsl;
sumup<=0 when countsl=max else countsl+1;
sumdw <= max when countsl=0 else countsl-1;

next_state_logic:
dcount <= sumup when sel="10" else
sumdw when sel="01" else countsl;

reg:PROCESS (ck)
BEGIN
IF falling_edge(ck) THEN countsl<=dcount; END IF;
END PROCESS reg;
END behavior;
Fig.5. Pulsed up/down counter VHDL code.
Fig.6. Simulation result: pulsed up/down counter
Timing diagram 1
st
case: EPM7032SLC44-5
Max7000S.
82






















































2.3. A third example
This third example is more general. A pulsed sequential
machine that has three pulsed input lines, Reset, X1 and
X2 and two lines as Mealy output, Z1 and Z2. The output
Z1 will be equal to X1 (Z1=X1) always an odd number of
pulses on X2 occurs between two consecutives pulses of
X1. The output Z2 will be equal to X2 (Z2=X2) always an
even number of pulses X1 occurs between two
consecutives pulses of X2. Overlapping sequences are
considered. When a pulse occurs on the Reset line the
initial state is granted.
The VHDL description is shown in figure 12. The
simulations results are present in figures 13, 14 and 15.
In the second case (EP1S10F484C5 Stratix), although
some messages about violation of time restriction as not
operational: Clock Skew > Data Delay and Warning:
Circuit may not operate. Detected 1 non-operational
path(s) clocked by clock "X1" with clock skew larger than
data delay, the simulation presents good result.
In the third case (EP2S15F484C3 Stratix II), the same
messages are displayed by Quartus II and, as can be seen
in figure 15, a spurious pulse is produced.
As an alternative way to avoid spurious pulses and a
best synthesis approach is to rewrite the next state logic
using concurrent commands. The new VHDL code is equal
the last code, except the description of the next state logic.
This VHDL code produces a good synthesis without the
timing violation messages for all cases. Figure 16 presents
this part of the new code and the figure 17, the simulation
results for the third case (EP2S15F484C3 StratixII).
3. CONCLUSION
The synthesis of pulsed sequential circuit and the classical
clocked sequential circuit are similar but the VHDL and
synthesis tools do not have resources to generate the
hardware with pulsed master-slave flip-flops. The designer
has to create hardware artifices in the context of the
available tools. In the models given in this paper, the input
signals are used, at the same time, to calculate the next
state and its storage in the memory cells, so the success of
operations of the synthesized circuits are strongly
dependent on the propagation time associated with the next
state logic and the setup time and the hold time of the
memories.
Since the Quartus II does not allow the control of the
time characteristics of the devices then, as shows in this
paper, the only possibility is to carefully select devices and
VHDL descriptions style.





Fig.8. Simulation result: pulsed up/down counter
Timing diagram 3
rd
case: EP2S15F484C3 StratixII.
Fig.7. Simulation result: pulsed up/down counter
Timing diagram 2
nd
case: EP1S10F484C5 Stratix.
Fig.9. Next state hardware structure with a latch.
n
s
















master
latch
D Q

En
















slave
register
D Q

s






random
logic
s
X

Fig.10. Simulation result
Timing diagram first case with a master latch
EPM7032SLC44-5 Max7000S.
Fig.11. Simulation result
Timing diagram second case with a master latch
EP1S10F484C5 Stratix.
83





































































































4. REFERENCES
[1] Fredrick J. Hill and Gerard R. Peterson, Introduction to
Switching Theory and Logical Design, Chapter 11: Pulse
Mode Circuits, Wiley 1974.
[2] Victor P. Nelson, H. Troy Nagle, Bill D. Carroll and J.
David Irwin, Digital Logic Circuit Analysis & Design,
Chapter 10: Asynchronous sequential Circuits, Prentice Hall
1995.
[3] Roberto dAmore, VHDL: Descrio e Sntese de Circuitos
Digitais, LTC, 2005.
[4] Volnei A. Pedroni, Circuit Design with VHDL, MIT Press
2004.
[5] Altera Publishing, Quartus II Introduction for VHDL
Users, PDF Tutorial Quartus II 9.0 web edition software
2007.

Library ieee; use ieee.std_logic_1164.all;
Entity dpulse_machinewcntZ1Z2 is
Generic (inp:integer:= 3;estX1:integer:=3;
estX2: integer:=2);
Port (X2,X1,Reset: in std_logic; Z2,Z1:out std_logic);
end entity dpulse_machinewcntZ1Z2;
Architecture func of dpulse_machinewcntZ1Z2 is
signal stX1m, stX1e: integer range 0 to estX1;
signal stX2m, stX2e: integer range 0 to estX2;
signal ck:std_logic;
Begin
ck<= X2 or X1 or Reset;
Z2<= X2 when stX1e=3 else '0'; -- counting X1
Z1<= X1 when stX2e=2 else '0'; -- counting X2
countingX1: Process (Reset, X2, X1)
Begin
If Reset='1' then stX1m<=0; elsif X2='1' then
stX1m<=1; elsif X1='1' then case stX1e is
when 0 => stX1m<=0; when 1 => stX1m<=2;
when 2 => stX1m<=3; when 3 => stX1m<=2;
end case; end if;
end process countingX1;
countingX2: Process (Reset,X2,X1)
Begin
If Reset='1' then stX2m<=0; elsif X1='1' then
stX2m<=1; elsif X2='1' then case stX2e is
when 0 => stX2m<=0; when 1 => stX2m<=2;
when 2 => stX2m<=1; end case; end if;
end process countingX2;
update: Process (ck) -- registers
Begin
If falling_edge(ck) then stX2e<=stX2m;
stX1e<=stX1m; end if; end process update; end func;
Fig.12. VHDL description.
Fig.13. Simulation result - Timing diagram 1
st
case
EPM7032SLC44-5 Max7000S.
Fig.14. Simulation result - Timing diagram 2
nd
case
EP1S10F484C5 Stratix.
Fig.15. Simulation result - Timing diagram 3
rd
case
EP2S15F484C3 StratixII.
-- next state logic
stX1m <= 0 when reset='1' else 1 when X2='1'else
0 when (X1='1' and stX1e=0) else
2 when (X1='1' and stX1e=1) else
3 when (X1='1' and stX1e=2) else
2 when (X1='1' and stX1e=3) else null;

stX2m <= 0 when reset='1' else 1 when X1='1'else
0 when (X2='1' and stX2e=0) else
2 when (X2='1' and stX2e=1) else
1 when (X2='1' and stX2e=2) else null;
Fig.16. New code for the next-state logic.
Fig.17. Timing diagram 3
rd
case
next-state logic with concurrent commands
EP2S15F484C3 StratixII.
84
COMPARATIVE STUDY BETWEEN THE IMPLEMENTATIONS OF DIGITAL
WAVEFORMS FREE OF THIRD HARMONIC ON FPGA AND MICROCONTROLLER
Diogo R. R. Freitas, Member IEEE, Edval J. P. Santos, Senior Member IEEE
Laboratrio de Dispositivos e Nanoestruturas, Departamento de Eletrnica e Sistemas
Universidade Federal de Pernambuco
Av. Acadmico Hlio Ramos, s/n, Cidade Universitria Recife PE Brasil 50.740-530
email: diogoroberto@ieee.org, edval@ee.ufpe.br
ABSTRACT
Third harmonic measurements are used to determine the
linearity level of passives, such as: resistors, capacitors,
and inductors, as recommended by IEC/TR 60440
standard. Signal generators with very low third harmonic
content have to be developed for such application.
Although, a high purity analog sine wave generator is the
natural option, it has been demonstrated that one can
generate digital waveforms free of third harmonic. This
paper presents a comparison between free of third
harmonic digital waveform generators built using FPGA -
Field Programmable Gate Array, and microcontroller.
1. INTRODUCTION
During the fabrication process of passives components,
such as: resistors, capacitors, and inductors, it is required to
assess the linearity of the fabricated component to
determine whether it has passed the quality test, as
recommended by IEC / TR 60440. The CLT10/CLT20 by
Danbridge A/S is an instrument for linearity testing. This
instrument generates a pure sine waveform of 10 kHz and
measures the third harmonic level at 30 kHz [1].
In a linear resistor the relationship between voltage and
current is constant and its value equal to the resistance. For
a nonlinear resistor the relationship between voltage and
current is a nonlinear function
) (v f i =
. The transfer
function shown in Fig. 1 can be defined as presented in
Equation (1) [2].
(1)
where O
V
and
I
V
are the DC components and o
v
and i
v

the AC components of input and output voltages.
Expanding the output voltage in power series, one obtains
Equation (2).
(2)


Fig. 1. (a) Reference points for plotting the transfer
function of resistors. (b) Example of nonlinear transfer
function.
Using that O
V a =
0 . This equation can be simplified as in
Equation (3).
(3)
The term 1
c
in Equation 3 is the linear gain. If the
components n
c
for n > 2 are not zero the circuit generates
harmonics at the output voltage. In real circuits these
components are rarely zero.
Assuming that the input voltage is a cosine,
) cos(
1
t V v
i
=
, o
v
is an even function of time. Therefore
the Fourier series sine coefficients are all null.
(4)
" + + + =
3
3
2
2 1 i i i o
v c v c v c v
" + + + + = +
3
3
2
2 1 0 i i i o O
v c v c v c c v V
) (
i I o O
v V f v V + = +
" + + =
2
1 2 1 1
) cos ( ) cos ( t V c t V c v
o

85


Fig. 2. Block diagram of the waveform generator circuit.
The cosine coefficients n
a
of the Fourier series are related
to the terms n
c
of power series, as follows:
(5)
(6)
(7)
(8)
Considering Fig. 1, if two reference points are marked
at the terminals of a generic resistor, say point 1 and point 2
(see Fig. 1(a)), one can plot the transfer curve for this
resistor. When current flows through the resistor from point
1 to point 2, the current is chosen positive and terminal 1
voltage is higher than terminal 2 voltage. When current
flows in the opposite sense, the voltage reverses its polarity.
One can see from this analysis that the voltage versus
current on a resistor should be an odd function
) ( ) ( i v i v =
. Real resistors are nonlinear, as shown in Fig.
1(b). One can inject a voltage and measure the current. If a
free of third harmonic periodic current waveform is injected
odd harmonics arise, due to non-linearity of resistance.
While the pure sine wave has only the fundamental
frequency with higher order harmonics equal to zero, there
exist special waveforms free of third harmonic, but not
necessarily free of higher order harmonics [3], [4].
The objective is to generate the proposed waveforms,
called Type I and Type II, using digital circuits, and
measure their frequency spectrum to compare with the
theoretical results. As calculated by Santos and Barybin [4],
it is expected that the Type I waveform has 2.23% third
harmonic level, when slew rate is 10 V/s. For the Type II
waveform is expected third harmonic content equal to zero
in all cases evaluated.
This paper is divided in five sections, this introduction
is the first. Next, materials and methods are discussed.
Third, the generated waveforms are evaluated. Fourth, the
discussion, and finally the conclusions.

Fig. 3. (above) Type I waveform generated by FPGA.
(below) Frequency spectrum.
2. MATERIALS AND METHODS
To build the waveform generator, two different
programmable devices are used: FPGA and
microcontroller. The first generator was based on FPGA. In
Fig. 2, the block diagram of the proposed circuit is
presented. The waveform generator is described in VHDL.
For implementation of this first generator, the development
board UP2 Education Kit from Altera [5] was used. This
board has a MAX7000 FPGA model EPM7128S. The
software used for circuit description in VHDL was Quartus
II 9.0 from Altera. Modelsim from Xilinx was used to
simulate the VHDL code. The harmonic analysis using Fast
Fourier Transform (FFT) was performed with the Agilent
oscilloscope model DSO3062A. The VHDL code has the
function of generating the digital words which are
responsible for producing the desired waveform. These
words are sent to a digital to analog converter (DAC)
external to the FPGA, via SPI. The generator used a 12-bit
DAC with reference voltage of 5 volts. With 4096 2
12
=
voltage levels, that is,
mV V 22 . 1 4096 5 =
of resolution.
The second generator is built with a microcontroller.
The selected microcontroller has an integrated
communication circuit SPI. The microcontroller code was
written in the C language to generate the words and
manipulate the SPI interface, and send them to the external
DAC for conversion.
For this second generator, the tool eZ430-F2013 from
Texas Instruments [6] was used. This board uses the
microcontroller MSP430 model MSP430F2013. The C
code was generated and compiled using the software IAR
"
"
"
"
+ + =
+ + =
+ + =
+ + =
5 5 3 3
3
4 4 2 2
2
3 3
1 1
4 4 2 2
0
16
5
4
2 2
4
3
8
3
2
V
c
V
c
a
V
c
V
c
a
V
c
V c a
V
c
V
c
a
86

Fig. 4. (above) Type II waveform generated by FPGA.
(below) Frequency spectrum.
Embedded Workbench Kickstart for MSP430 from IAR
Systems.
In addition to development boards, a 12-bit DAC from
Microchip model MCP4921 [7] was also used. This
converter has an integrated SPI interface communication.
The block diagram for this is illustrated in Fig. 2. The SPI
block gets the words and sends them to the shift register.
The register routes the words to the DAC for conversion.
3. EVALUATION OF GENERATED WAVEFORMS
Waveforms Type I and Type II generated from the
development board were analyzed by an oscilloscope with
Fast Fourier Transform (FFT), for evaluation of harmonics.
The results are presented next.
3.1. Evaluation of the Type I and Type II waveform
generated by the FPGA
For this waveform the analysis using the FFT is based on
the curve shown in Fig. 3. For analysis by FFT we have
horizontal resolution of 10 kHz per division (10 kHz / div).
In the vertical resolution we have 100 mV rms per division
(100 mVrms / div). The analysis of the harmonics shows
the absence of even harmonics, as expected, and the
absence of the third harmonic. Other harmonics are
presented, the fifth, seventh and ninth harmonics subtly.
For Type II waveform, the FFT harmonic analysis is
shown in Fig. 4. This second waveform displays almost
identical harmonic content to that of Fig. 3. The third
harmonic is not present.

Fig. 5. (above) Type I waveform generated by the
microcontroller. (below) Frequency spectrum.
3.2. Evaluation of the Type I and Type II waveform
generated by the microcontroller
The waveform generated by the microcontroller is shown in
Fig. 5, together with the analysis by FFT. One notes the
great similarity to that of Fig. 3. The Type II waveform
generated by the microcontroller is shown in Fig. 6, along
with the analysis by FFT. Here one sees a difference with
the Type II waveform generated by FPGA. The second,
third and sixth harmonic are clearly presented. The fourth
harmonic is absent.
4. DISCUSSION
After carefully analyzing the waveforms generated one
observed that Type II generated by the microcontroller (Fig.
6) does not have symmetry, for positive and negative
voltage peaks. The C code used defines exactly half of the
reference voltage of 12-bit DAC (7FF). This lack of
symmetry is also observed in the waveform Type I, Fig. 5,
but not observed significant changes in harmonics when
compared to Figs. 3 and 5.
In the waveforms generated by the FPGA, this
symmetry between the positive and negative voltage peaks
is observed. This symmetry ensures that the spectrum has
only the fifth and seventh harmonics.
The Type II waveform was simulated taking into
account the non-symmetry of the positive peak. Another
simulation was performed with the Type II waveform
symmetric. The results of spectral analysis using FFT are
shown in Fig. 7. The software used was Matlab 7.
87
Fig. 6. (above) Type II waveform generated by the
microcontroller. (below) Frequency spectrum.
The symmetric Type II waveform simulated presents
the fifth and seventh harmonics. This result match with the
measurement performed in Fig. 4. The non-symmetric Type
II waveform simulated features even harmonic (second and
fourth). The measurement performed in Fig. 6 shows subtly
the third harmonic, which is not present in the simulated
wave in Fig. 7. This difference is possibly caused by noise
in the positive peak of the wave observed in Fig.6. For the
real Type II waveform, variation in the frequency spectrum
was not observed when a 30 lag is included in the
traditional waveform (Fig. 2). This delay was simulated in
Matlab and no variation was observed in the frequency
spectrum.
5. CONCLUSION
As expected by the analysis by Santos and Barybin, the
real Type II waveform did not present the third harmonic
content. These measurements confirm the theoretical
results obtained. So this waveform can be used to assess
non-linearity of passive components.
The proposal to generate a signal free of third harmonic
was more successful in FPGA. Symmetry defects were
observed in the signal generated by the microcontroller
used, compromising the final result.
The next step is to apply the waveforms generated in
real resistors and evaluate its linearity using spectral
analysis.

Fig. 7. Spectral analysis performed on Matlab 7 for non-
symmetric Type II waveform (above) and symmetric Type
II waveform (below).
6. REFERENCES
[1] Danbridge A/S, CLT 10 Component Linearity Test
Equipment Application Note. 2002. <http://
danbridge.dk.web13.123test.dk/Files/filelement30.pdf>
[2] D. Pederson, K. Mayaram, Analog Integrated Circuits for
Communication. New York: Springer, 2008.
[3] P. Corey, Methods for optimizing the waveform of stepped-
wave static inverters, AIEE Summer General Meeting, Jun.
1962.
[4] E. Santos, A. Barybin, Stepped-waveform synthesis for
reducing third harmonic content, IEEE Transactions on
instrumentation and measurement, vol. 54, no. 3, pp. 1296-
1302, Jun. 2005.
[5] Altera Corporation, University Program UP2 Education Kit
User Guide. Dec. 2004. <http://www.altera.com/
literature/univ/upds.pdf>
[6] Texas Instruments, eZ430-F2013 Development Tool Users
Guide. 2006. <http://focus.ti.com/lit/ug/slau176b/
slau176b.pdf>
[7] Microchip Tecnology Inc., 12-Bit DAC with SPI Interface.
2004. <http://ww1.microchip.com/downloads/en/
DeviceDoc/21897a.pdf

88
AUTHORS INDEX

Arajo, A. H. C. B. S. ..................................................................................................75
Barreto, R. S. ..............................................................................................................29
Belmonte, J. ................................................................................................................7
Boemo, E. ..............................................................................................................19
Borensztjen, P. ............................................................................................13, 57
Caraciolo, M. P. ..................................................................................................69
Caruso, D. M. ................................................................................................................1
Carvalho, G. R. ..................................................................................................75
Cayssials, R. ............................................................................................................. 43
Cordeiro, F. R. ................................................................................................. 69
Corti, R. ............................................................................................................... 7
Crepaldo, D. A. ................................................................................................. 25
DAgostino, E. ................................................................................................... 7
De Farias, T. M. T. ..................................................................................................... 35
De Lima, J. A. G. ................................................................................................. 35
De Lima, M. E. ........................................................................................................... 75
De Maria, E. A. A. ................................................................................................. 39
Del Rios, J. ............................................................................................................. 63
Dias, W. R. A. ............................................................................................................. 29
Ferreira, L. P. ............................................................................................................. 69
Ferro, E. ............................................................................................................. 43
Freitas, D. R. R. ................................................................................................. 85
Gazineu, G. M. ................................................................................................. 75
Giandomnico, E. .................................................................................................. 7
Luppe, M. ............................................................................................................ 47
Maidana, C. E............................................................................................................. 39
Martin, R. L. ............................................................................................................ 25
Martnez, R. .............................................................................................................. 7
Mesquita Jnior, A. C. .....................................................................................81
Moreno, E. D. ............................................................................................................ 29
Mosquera, J. ............................................................................................................ 13
Oliveira, D. L. ............................................................................................................ 63
Ortega-Ruiz, J. ........................................................................................................... 19
Pedre, S. .......................................................................................................13, 57
Prado, R. G. ............................................................................................................ 75
Romano, L. ............................................................................................................ 63
Sacco, M. ............................................................................................................ 13
Schiavon, M. I. .......................................................................................................... 25
Santos, E. J. P ............................................................................................................ 85
Silva-Filho, A. G. ......................................................................................... 69, 75
Soares, D. ............................................................................................................ 53
Stoliar, A. .......................................................................................................13, 57
Szklanny, F. I. ............................................................................................................ 39
Torquato, L. ............................................................................................................ 53
Tropea, S. E. ...............................................................................................................1
Urriza, J. ............................................................................................................ 43
Viana, P. ............................................................................................................ 53
ISBN:

S-ar putea să vă placă și