Sunteți pe pagina 1din 148

CUPRINS

Partea 1. Explorarea utilizrii



Capitolul 1. Statistici web............................................................. 7
! Votare online i analiza rezultatelor
Capitolul 2. Clasificarea bazat pe cei mai apropiai vecini.. 18
! Prezentarea grafic a regiunilor de decizie
Capitolul 3. Clasificarea bazat pe arbori de decizie ............ 30
! tiri personalizate pe baza profilurilor utilizatorilor
Capitolul 4. Inducia regulilor de asociere.............................. 49
! Achiziii online i analiza coului de cumprturi

Partea 2. Explorarea coninutului

Capitolul 5. Reele neuronale .................................................... 67
! Predicia traficului pe un site
! Detecia spam-ului
Capitolul 6. Clasificarea bayesian........................................... 99
! Detecia spam-ului
! Relevana cutarilor n documente
Capitolul 7. Partiionarea k-medii .......................................... 112
! Vizualizarea procesului de partiionare k-medii
! Partiionarea documentelor

Partea 3. Explorarea structurii

Capitolul 8. Motoare de cutare .............................................. 135
! Indexarea i cutarea ntr-un site
Capitolul 9. Relevana paginilor web .................................... 146
! Algoritmul PageRank
id252619031 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
EXPLORAREA
UTILIZRII



Capitolul 1. Statistici web
! Votare online i analiza rezultatelor

Capitolul 2. Clasificarea bazat pe cei mai apropiai vecini
! Prezentarea grafic a regiunilor de decizie

Capitolul 3. Clasificarea bazat pe arbori de decizie
! tiri personalizate pe baza profilurilor utilizatorilor

Capitolul 4. Inducia regulilor de asociere
! Achiziii online i analiza coului de cumprturi

Partea
id252658125 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
7
Capitolul 1


STATISTICI WEB


n acest capitol vom descrie o modalitate preprocesare a unei baze de
date cu voturi exprimate online, prin eliminarea valorilor extreme nainte de
aplicarea analizei statistice. Pentru a testa programul, mai nti se va genera
o baz de date cu voturi, ca punct de plecare pentru programul principal.
ntr-un mediu de execuie real, acest pas va fi desigur omis. Programul
principal ncarc voturile exprimate, permite votarea propriu-zis online i
afieaz media voturilor calculat n diferite modaliti.

Aplicaia 1.1. Generarea unei baze de date de voturi pe baza unui
model predefinit.


using System;
using System.IO;


namespace GenDB
{
class Program
{
static void Main(string[] args)
{
StreamWriter sw = new StreamWriter("database.txt");
Random r = new Random();

for (int i = 0; i < 1000; i++)
{
int varsta = r.Next(4);
int sex = r.Next(2);
int tara = r.Next(3);
int vot = 1;

switch (tara)
{
case 0:
if (sex == 0)
vot = 8 + (int)(r.NextDouble() - 0.5) * 2;

id251885015 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

8
else
vot = 5 + (int)(r.NextDouble() - 0.5) * 2;
break;
case 1:
if (sex == 0)
vot = 10 + (int)(r.NextDouble() - 0.5) * 2;
else
vot = 5 + (int)(r.NextDouble() - 0.5) * 2;
break;
case 2:
vot = 7 + (int)(r.NextDouble() - 0.5) * 2;
break;
}

if (r.NextDouble() < 0.05)
vot = 2 + (int)(r.NextDouble() - 0.5) * 2;

if (vot > 10) vot = 10;
if (vot < 1) vot = 1;

sw.WriteLine("" + varsta + "\t" + sex + "\t" + tara + "\t" + vot);
}
sw.Close();

Console.WriteLine("Fisierul a fost generat.");
Console.ReadKey();

}
}
}



Figura 1.1. Mesaj de confirmare pentru generarea bazei de date de test
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Statistici web

9
Coninutul bazei de date este prezentat n figura 1.2.


Figura 1.2. Extras din baza de date de voturi


Aplicaia 1.2. Programul principal, n care utilizatorii pot vota.
Votul se poate realiza cu note ntregi, de la 1 la 10, sau mai simplu, prin
calificative, care vor fi transformate automat n note. Afiarea rezultatelor
(mediei voturilor) se face n trei variante: media aritmetic efectiv, media
aritmetic a voturilor rmase prin eliminarea a 10% din valorile extreme i
media aritmetic a voturilor rmase n intervalul determinat de o deviaie
standard (1), considernd distribuia voturilor ca o distribuie normal.


1.2.1. Default.aspx.cs nregistreaz un vot exprimat online.

Proiectarea interfeei grafice este prezentat n figura 1.3. n dreptul
controalelor sunt indicate tipul i numele utilizate ulterior n codul surs.
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

10

Figura 1.3. Proiectarea interfeei grafice pentru nregistrarea unui vot


using System;
using System.IO;


public partial class _Default : System.Web.UI.Page
{
protected void ButtonVoteaza_Click(object sender, EventArgs e)
{
// apasare buton

bool ok = true;
int age = RadioButtonListCategVarsta.SelectedIndex;
int sex = RadioButtonListSex.SelectedIndex;
int tara = DropDownListTara.SelectedIndex;

int nota = -1;
if (RadioButtonListTipVot.SelectedIndex == 0) // vot simplu
{
switch (DropDownListVotSimplu.SelectedIndex)
{
case 0: nota = 0; // foarte rau
break;
case 1: nota = 2; // rau
break;
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Statistici web

11
case 2: nota = 5; // mediu
break;
case 3: nota = 8; // bine
break;
case 4: nota = 10; // foarte bine
break;
}
}
else // vot complex
{
try
{
nota = Convert.ToInt32(TextBoxVotComplex.Text);
}
catch // daca nu a fost introdusa o valoare numerica
{
ok = false;
}

if (nota < 0 || nota > 10) // nota trebuie sa fie intre 0 si 10
ok = false;
}

if (ok)
{
StreamWriter sw = new StreamWriter(Server.MapPath(".") + "\\records.log",
true);
sw.WriteLine("" + age + "\t" + sex + "\t" + tara + "\t" + nota);
sw.Close();
Response.Redirect("Recorded.htm");
}
}
}


1.2.2. Recorded.htm Pagina HTML de confirmare a introducerii
unui vot.


<html>
<head>
<title>Vot inregistrat</title>
</head>
<body>
<p>Votul a fost inregistrat</p>
<p><a href="Default.aspx">Inapoi</a></p>
<p><a href="Statistics.aspx">Vezi statistici</a></p>
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

12
</body>
</html>


1.2.3. Statistics.aspx.cs Calculeaz i afieaz statisticile voturilor.
Utilizatorul are posibilitatea de a selecta submulimea de date pentru
analiz. ntruct profilul utilizatorului este dat de ar, sex i vrst, analiza
poate fi realizat pentru toate aceste valori sau pentru segmentele dorite:
doar o anumit ar, doar un anumit sex sau doar o categorie de vrst.


Figura 1.4. Proiectarea interfeei grafice pentru afiarea statisticilor


using System;
using System.IO;


public partial class Statistics : System.Web.UI.Page
{
int[,] date; int[,] date_sel;
int nr_date, nr_date_sel;


protected void Page_Load(object sender, EventArgs e)
{
const int MaxData = 10000;
date = new int[MaxData, 4]; // datele initiale
date_sel = new int[MaxData, 4]; // datele din selectie

StreamReader sr = new StreamReader(Server.MapPath(".") + "\\records.log");
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Statistici web

13
int i = 0;

while (sr.Peek() != -1)
{
string line = sr.ReadLine();
string[] toks = line.Split();


for (int j = 0; j < 4; j++)
date[i, j] = Convert.ToInt32(toks[j]);
i++;
}
nr_date = i;
sr.Close();
}


protected void ButtonAnaliza_Click(object sender, EventArgs e)
{
int tara = DropDownListTara.SelectedIndex;
int sex = DropDownListSex.SelectedIndex;
int varsta = DropDownListVarsta.SelectedIndex;

int nr = 0, suma = 0;
nr_date_sel = 0;

for (int i = 0; i < nr_date; i++)
{
bool btara = (DropDownListTara.SelectedIndex ==
DropDownListTara.Items.Count - 1);
bool bvarsta = (DropDownListVarsta.SelectedIndex ==
DropDownListVarsta.Items.Count - 1);
bool bsex = (DropDownListSex.SelectedIndex ==
DropDownListSex.Items.Count - 1);

if ((btara || date[i, 2] == tara) &&
(bsex || date[i, 1] == sex) &&
(bvarsta || date[i, 0] == varsta))
{
nr++;
suma += date[i, 3];

for (int j = 0; j < 4; j++)
date_sel[nr_date_sel, j] = date[i, j];
nr_date_sel++;
}
}

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

14
if (nr == 0)
{
TextBoxMediaNormala.Text = "Imposibil";
return;
}

double media = (double)suma / (double)nr;

TextBoxMediaNormala.Text = string.Format("{0:F2} - {1}", media, nr);
TextBoxMediaI1.Text = string.Format("{0:F2}",
MediaFaraExtremeProcent10());
TextBoxMediaI2.Text = string.Format("{0:F2}",
MediaFaraExtremeSigma(media));
}


private double MediaFaraExtremeProcent10()
{
int[,] date_p5 = new int[1000, 4]; // datele din selectie fara extreme 10%

int nr_date_p5 = 0;

Shell(date_sel, nr_date_sel);

int sum = 0;

for (int i = (int)(0.1 * nr_date_sel); i < (int)(0.9 * nr_date_sel); i++)
{
for (int j = 0; j < 4; j++)
date_p5[nr_date_p5, j] = date_sel[i, j];
sum += date_sel[nr_date_p5++, 3];
}

if (nr_date_p5 == 0)
return -1;
else
return (double)sum / (double)nr_date_p5;
}


private void Shell(int[,] v, int len)
{
// algoritmul de sortare Shell-Sort

int dist, i, j, aux;



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Statistici web

15
for (dist = len / 2; dist > 0; dist /= 2)
for (i = dist; i < len; i++)
for (j = i - dist; j >= 0 && v[j, 3] > v[j + dist, 3]; j -= dist)
{
for (int t = 0; t < 4; t++)
{
aux = v[j, t];
v[j, t] = v[j + dist, t];
v[j + dist, t] = aux;
}
}
}


private double MediaFaraExtremeSigma(double media)
{
double sum = 0;

for (int i = 0; i < nr_date_sel; i++)
{
sum += (date_sel[i, 3] - media) * (date_sel[i, 3] - media);
}
double sigma = Math.Sqrt(sum / (nr_date_sel - 1));

int suma_nr = 0;
int nr_date_sigma = 0;

double dsigma = 1;

for (int i = 0; i < nr_date_sel; i++)
{
if (date_sel[i, 3] >= media - dsigma * sigma &&
date_sel[i, 3] <= media + dsigma * sigma)
{
nr_date_sigma++;
suma_nr += date_sel[i, 3];
}
}

if (nr_date_sigma == 0)
return -1;
else
return (double)suma_nr / (double)nr_date_sigma;
}
}


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

16
ntreaga aplicaie n execuie poate fi urmrit n figurile urmtoare:
introducerea unui vot (figura 1.5), confirmarea nregistrrii votului i
alegerea aciunii urmtoare introducerea unui nou vot sau afiarea
statisticilor (figura 1.6), respectiv afiarea mediei voturilor calculat n trei
variante diferite una direct i dou pe baza preprocesrii datelor i
eliminrii valorilor extreme (figura 1.7).


Figura 1.5. Introducerea unui vot


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Statistici web

17

Figura 1.6. Confirmarea nregistrrii votului i alegerea aciunii urmtoare



Figura 1.7. Afiarea statisticilor

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
18
Capitolul 2


CLASIFICAREA BAZAT PE
CEI MAI APROPIAI VECINI


n acest capitol vom prezenta o serie de programe utile n lucrul cu
clasificatori bazai pe cei mai apropiai vecini.

Aplicaia 2.1. Calcularea distanelor unei instane noi fa de
instanele din setul de antrenare. Se utilizeaz distana euclidian i
ponderarea atributelor.


using System;
using System.IO;


namespace Distante
{
class Program
{
static void Main(string[] args)
{
StreamReader sr = new StreamReader("date.txt");
const int noAttrib = 5;
const int noInst = 14;

double[,] date = new double[noInst, noAttrib];
for (int i = 0; i < noInst; i++)
{
string line = sr.ReadLine();
string[] toks = line.Split(" \t".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
for (int j = 0; j < noAttrib; j++)
date[i, j] = Convert.ToDouble(toks[j]);
}

sr.Close();

// instanta noastra
double[] testInst = new double[] { 0, 0.52, 0.16, 0 };
//double[] testInst = new double[] { 0, 0.33, 0.84, 1 };
//double[] testInst = new double[] { 0, 0.32, 0.86, 1 };
id251976671 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe cei mai apropiai vecini

19
StreamWriter sw = new StreamWriter("distante.txt");

sw.WriteLine("Distanta (d)\tPonderea instantei (1/d^2)\tCu ponderarea
atributelor");

double voturiPentruDa = 0;
double voturiPentruNu = 0;

int clasaImpusa = -1; // initial nu e nicio clasa impusa

double[] weight2 = { 0.736, 0, 0.181, 0.083 };

for (int i = 0; i < noInst; i++)
{
double dist = 0;
for (int j = 0; j < noAttrib - 1; j++)
dist += (date[i, j] - testInst[j]) * (date[i, j] - testInst[j]);

dist = Math.Sqrt(dist); // distanta euclidiana
sw.Write("" + dist + "\t");

if (dist < 1e-12) // dist == 0
{
sw.Write("dist = 0\t");
clasaImpusa = (int)date[i, noAttrib - 1]; // clasa instantei identice cu
// instanta noastra
}
else
{
sw.Write("" + (1 / (dist * dist)) + "\t");
if (date[i, noAttrib - 1] == 0)
voturiPentruNu += 1 / (dist * dist);
else // (date[i, noAttrib - 1] == 1)
voturiPentruDa += 1 / (dist * dist);
}
dist = 0;
for (int j = 0; j < noAttrib - 1; j++)
dist += (date[i, j] - testInst[j]) * (date[i, j] - testInst[j]) * weight2[j];

dist = Math.Sqrt(dist); // distanta euclidiana
sw.WriteLine("" + dist);
}

if (clasaImpusa == -1)
{
sw.WriteLine("\r\nVoturi pentru da/1 : " + voturiPentruDa +
"\r\nVoturi pentru nu/0 : " + voturiPentruNu);
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

20
sw.Close();
}
}
}

n figura 2.1 se prezint datele din setul de antrenare. n figura 2.2
sunt precizate distanele fa de fiecare instan din acest set, ponderile
influenelor acestor instane, invers proporionale cu ptratul distanelor,
precum i importana instanelor n cazul n care atributele au ponderi
diferite. Considernd fiecare instan din setul de antrenare ca aparinnd
unei clase, n final se prezint suma voturilor ponderate cu privire la
apartenena instanei noi la una din cele dou clase.


Figura 2.1. Instanele din setul de antrenare


Figura 2.2. Calculul distanelor i clasificarea instanei noi


Aplicaia 2.2. Pentru a evita influena disproporionat de mare a
atributelor cu domeniu de definiie mai mare, de obicei se normalizeaz
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe cei mai apropiai vecini

21
valorile atributelor ntre 0 i 1. Programul urmtor normalizeaz o valoare
ntre minimul i maximul domeniului de definiie.


Figura 2.3. Proiectarea interfeei aplicaiei de normalizare


using System;
using System.Windows.Forms;


namespace Normalizare
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}

private void buttonNorm_Click(object sender, EventArgs e)
{
double min = Convert.ToDouble(textBoxMin.Text);
double max = Convert.ToDouble(textBoxMax.Text);
double val = Convert.ToDouble(textBoxVal.Text);

double valNorm = (val - min) / (max - min);

textBoxNorm.Text = valNorm.ToString();
}

}
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

22

Figura 2.3. Aplicaia de normalizare n execuie


Aplicaia 2.3. O bibliotec DLL care implementeaz clasificatorul
k-Nearest Neighbor.


2.3.1. knn.cs Codul surs al bibliotecii DLL.


using System;
using System.Collections.Generic;


namespace KNNClassifier
{
/// <summary>
/// Clasificatorul k-nearest neighbors
/// </summary>
public class KNN
{
public readonly List<Instance> Instances;
public int NoClasses;

public KNN(List<Instance> trainingSet, int numberOfClasses)
{
this.Instances = trainingSet;
this.NoClasses = numberOfClasses;
}

public void AddInstance(Instance instance)
{
Instances.Add(instance);
}

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe cei mai apropiai vecini

23
public void Clear()
{
Instances.Clear();
}


/// <summary>
/// Clasificarea propriu-zisa
/// </summary>
/// <param name="inst">Instanta noua care trebuie clasificata</param>
/// <param name="k">Numarul de vecini</param>
/// <returns>Clasa (>=0) sau -1 in caz de eroare</returns>
public int Classify(Instance inst, int k)
{
if (Instances == null)
throw new Exception("Setul de antrenare nu este initializat");
if (Instances.Count == 0)
throw new Exception("Setul de antrenare nu contine nicio instanta");

if (k > Instances.Count)
throw new Exception("Sunt mai putine instante in setul de antrenare decat k");

double[,] distances = new double[Instances.Count, 2];

for (int i = 0; i < Instances.Count; i++)
{
Instance existingInstance = Instances[i];

double dist = 0;
for (int j = 0; j < inst.AttributeValues.Length; j++)
{
dist += (existingInstance.AttributeValues[j] - inst.AttributeValues[j]) *
(existingInstance.AttributeValues[j] - inst.AttributeValues[j]);
}

distances[i, 0] = dist; // poate fi Math.Sqrt(dist)
distances[i, 1] = i;
}

ShellSort(distances, Instances.Count);

if (k == 1)
{
Instance nn = Instances[(int)distances[0, 1]];
return nn.Class;
}
else
{
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

24
double[] votes = new double[NoClasses];

for (int m = 0; m < k; m++)
{
Instance nn = Instances[(int)distances[m, 1]];

if (distances[m, 0] < 1e-12)
return nn.Class;

votes[nn.Class] += 1 / (distances[m, 0] * distances[m, 0]);
// poate fi si: votes[nn.Class] += 1 / Math.Exp(-distances[m,0]);
}

int maxIndex = 0; double max = votes[0];

for (int m = 1; m < NoClasses; m++)
{
if (votes[m] > max)
{
max = votes[m];
maxIndex = m;
}
}
return maxIndex;
}
}


/// <summary>
/// Sortarea crescatoare a distantelor
/// </summary>
/// <param name="v"></param>
/// <param name="len"></param>
private void ShellSort(double[,] v, int len)
{
int dist, i, j;
double aux;

for (dist = len / 2; dist > 0; dist /= 2)
for (i = dist; i < len; i++)
for (j = i - dist; j >= 0 && v[j, 0] > v[j + dist, 0]; j -= dist)
{
aux = v[j, 0]; v[j, 0] = v[j + dist, 0]; v[j + dist, 0] = aux;
aux = v[j, 1]; v[j, 1] = v[j + dist, 1]; v[j + dist, 1] = aux;
}
}
}
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe cei mai apropiai vecini

25
2.3.2. Instance.cs Clasa corespunztoare unei instane


namespace KNNClassifier
{
public class Instance
{
public double[] AttributeValues;
public int Class;


/// <summary>
/// Constructorul
/// </summary>
/// <param name="values">Valorile atributelor (normalizate intre 0 si 1)</param>
/// <param name="instanceClass">Clasa careia apartine instanta</param>
public Instance(double[] values, int instanceClass)
{
this.AttributeValues = values;
this.Class = instanceClass;
}
}
}


Aplicaia 2.4. Program care utilizeaz clasa knn.dll pentru a realiza
o clasificare a instanelor date dintr-o interfa grafic. Utilizatorul seteaz
puncte de dou culori ntr-o regiune de desenare. Punctele roii (gri mai
deschis n figurile de mai jos) se seteaz cu click stnga iar punctele albastre
(gri mai nchis) cu click dreapta. Programul apoi calculeaz clasa pentru
fiecare pixel din zona de desenare. Rezultatul exprim regiunile de decizie
ale clasificatorului considerat. Numrul de vecini luat n calcul este dat prin
program la apelul funciei de clasificare: inst.Class = knn.Classify(inst,3).
Apsarea butonului Draw determin desenarea regiunilor de decizie, iar
apsarea butonului Clear determin tergerea instanelor de antrenare (a
punctelor) curente.


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

26

Figura 2.4. Proiectarea interfeei programului de determinare a regiunilor de decizie


using System;
using System.Drawing;
using System.Collections.Generic;
using System.Windows.Forms;

using KNNClassifier;


namespace RegiuniNN
{
public class Form1 : System.Windows.Forms.Form
{
private KNN knn;

public Form1()
{
InitializeComponent();

knn = new KNN(new System.Collections.Generic.List<Instance>(), 2);
}





Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe cei mai apropiai vecini

27
private void pictureBox_MouseUp(object sender,
System.Windows.Forms.MouseEventArgs e)
{
int cl = -1;

if (e.Button == MouseButtons.Left)
cl = 0;
else if (e.Button == MouseButtons.Right)
cl = 1;

if (cl != -1) // s-a apasat ori butonul stang ori cel drept
{
double[] val = new double[2];
val[0] = e.X / (double)pictureBox.Width;
val[1] = e.Y / (double)pictureBox.Height;
Instance inst = new Instance(val, cl);
knn.AddInstance(inst);
pictureBox.Refresh();
}
}


private void pictureBox_Paint(object sender, System.Windows.Forms.PaintEventArgs e)
{
if (knn.Instances == null)
return;

DrawDatabaseInstances(e.Graphics);
}


private void buttonClear_Click(object sender, System.EventArgs e)
{
knn.Clear();
pictureBox.Refresh();
}


private void DrawDatabaseInstances(Graphics g)
{
for (int i = 0; i < knn.Instances.Count; i++)
{
Instance inst = knn.Instances[i];
int x = (int)(inst.AttributeValues[0] * pictureBox.Width);
int y = (int)(inst.AttributeValues[1] * pictureBox.Height);
int cl = inst.Class;


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

28
if (cl == 0)
{
g.FillEllipse(Brushes.Red, x - 4, y - 4, 8, 8);
g.DrawEllipse(Pens.Pink, x - 4, y - 4, 8, 8);
}
else // if (cl == 1)
{
g.FillEllipse(Brushes.Blue, x - 4, y - 4, 8, 8);
g.DrawEllipse(Pens.Green, x - 4, y - 4, 8, 8);
}
}
}


private void buttonDraw_Click(object sender, System.EventArgs e)
{
Graphics g = pictureBox.CreateGraphics();

for (int x = 0; x < pictureBox.Width; x++)
for (int y = 0; y < pictureBox.Height; y++)
{
double[] val = new double[2];
val[0] = x / (double)pictureBox.Width;
val[1] = y / (double)pictureBox.Height;
Instance inst = new Instance(val, -1);

try
{
int k = 1; // poate fi > 1
inst.Class = knn.Classify(inst, k);
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
return;
}

if (inst.Class == 0)
g.FillRectangle(Brushes.Red, x, y, 1, 1);
else if (inst.Class == 1)
g.FillRectangle(Brushes.Blue, x, y, 1, 1);
}

DrawDatabaseInstances(g);
}
}
}

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe cei mai apropiai vecini

29
Evoluia programului n timpul execuiei poate fi urmrit n figurile
urmtoare: punctele selectate, instanele de antrenare n figura 2.5 i
regiunile de decizie ale clasificatorului n figura 2.6.


Figura 2.5. Instanele de antrenare


Figura 2.6. Regiunile de decizie ale clasificatorului

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
30
Capitolul 3


CLASIFICAREA BAZAT PE ARBORI DE DECIZIE


n acest capitol vom prezenta o serie de aplicaii bazate pe o alt
metod de clasificare, i anume cea a arborilor de decizie. Scopul suitei de
aplicaii este realizarea unui site pe care utilizatorii s se nregistreze i s-i
creeze un profil personal, iar la autentificarea pe site acesta s le furnizeze
tiri din domenii personalizate, n funcie de profil (atribute explicite) i
momentul logrii (atribut implicit). Deoarece aceti algoritmi sunt oarecum
dificili, am decis s utilizm implementarea Weka, disponibil gratuit la
adresa http://www.cs.waikato.ac.nz/ml/weka, care conine o suit de
algoritmi de data mining open-source, descrii n volumul: I. H. Witten, E.
Frank, Data Mining: Practical machine learning tools with Java
implementations, Morgan Kaufmann, San Francisco, 2000. Weka utilizeaz
un format specific, ARFF (Attribute-Relation File Format), pentru
descrierea seturilor de date de antrenare.


Aplicaia 3.1. Generarea unei baze de date pentru profilul
utilizatorilor, folosit pentru testarea algoritmilor de construcie a arborilor
de decizie. Setul de date de antrenare este salvat ntr-un fiier cu formatul
ARFFF. Se creeaz un att un fiier care conine i atribute reale, pentru
aplicarea algoritmului C4.5, ct i un fiier n care valorile reale sunt
discretizate, pentru aplicarea algoritmului mai simplu ID3. Utilizatorul poate
decide dac datele generate vor conine zgomot (clase eronate) i proporia
sa n setul de antrenare.


using System;
using System.IO;


namespace GenerateDatabase
{
class Program
{
static void Main(string[] args)
{
StreamWriter sw = new StreamWriter("profiles.arff");

id252073906 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

31
StreamWriter swDiscr = new StreamWriter("profilesDiscretized.arff");
Random r = new Random();

/*
@RELATION profiles

@ATTRIBUTE age REAL
@ATTRIBUTE domain {Tehnic, Social}
@ATTRIBUTE time REAL
@ATTRIBUTE class {Politice, Sportive, Mondene}

@DATA
*/

sw.WriteLine("@RELATION profiles\r\n\r\n@ATTRIBUTE age
REAL\r\n@ATTRIBUTE domain {Tehnic, Social}\r\n" +
"@ATTRIBUTE time REAL\r\n@ATTRIBUTE class {Politice, Sportive,
Mondene}\r\n\r\n@DATA");

/*
@RELATION profiles-discretized

@ATTRIBUTE age {VeryLow, Low, Medium, High, VeryHigh}
@ATTRIBUTE domain {Tehnic, Social}
@ATTRIBUTE time {Night, Morning, Afternoon, Evening}
@ATTRIBUTE class {Politice, Sportive, Mondene}

@DATA
*/

swDiscr.WriteLine("@RELATION profiles\r\n\r\n@ATTRIBUTE age
{VeryLow, Low, Medium, High, VeryHigh}\r\n@ATTRIBUTE domain {Tehnic,
Social}\r\n" +
"@ATTRIBUTE time {Night, Morning, Afternoon,
Evening}\r\n@ATTRIBUTE class {Politice, Sportive, Mondene}\r\n\r\n@DATA");

int noInstances = 1000;
bool hasNoise = false;

for (int i = 0; i < noInstances; i++)
{
int varsta = 12 + r.Next(60);
int domeniu = r.Next(2);
int timp = r.Next(24);
int clasa = 1;

if (domeniu == 1) // Social
{
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

32
if (varsta < 40)
clasa = 2; // Mondene
else
clasa = 0; // Politice
}
else // Tehnic
{
if (varsta < 30)
clasa = 1; // Sportive
else
{
if (timp > 18)
clasa = 0; // Politice
else
clasa = 2; // Mondene
}
}

if (hasNoise && r.NextDouble() < 0.1)
clasa = r.Next(3);

sw.WriteLine(varsta.ToString() + ", " + Domain(domeniu) + ", " + timp + ", "
+ NewsCategory(clasa));

const int AgeIntervals = 5; const int TimeIntervals = 4;

swDiscr.WriteLine(Age(Discretizare(varsta, 12, 72, AgeIntervals)) + ", " +
Domain(domeniu) + ", " +
TimeOfDay(Discretizare(timp, 0, 24, TimeIntervals)) + ", " +
NewsCategory(clasa));
}

sw.Close();
swDiscr.Close();
}


private static string Domain(int domeniu)
{
switch (domeniu)
{
case 0: return "Tehnic";
case 1: return "Social";
}
return string.Empty;
}


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

33
private static string NewsCategory(int clasa)
{
switch (clasa)
{
case 0: return "Politice";
case 1: return "Sportive";
case 2: return "Mondene";
}
return string.Empty;
}


private static int Discretizare(int valoare, int min, int max, int nrIntervale)
{
int x = nrIntervale * (valoare - min) / (max - min);
if (valoare >= max)
x = nrIntervale - 1;
return x;
}


private static string Age(int index)
{
switch (index)
{
case 0: return "VeryLow";
case 1: return "Low";
case 2: return "Medium";
case 3: return "High";
case 4: return "VeryHigh";
}
return string.Empty;
}


private static string TimeOfDay(int index)
{
switch (index)
{
case 0: return "Night";
case 1: return "Morning";
case 2: return "Afternoon";
case 3: return "Evening";
}
return string.Empty;
}
}
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

34
Fiierele generate sunt prezentate n figurile 3.1 i 3.2.


Figura 3.1. Fiier cu date de antrenare coninnd atribute reale


Figura 3.2. Fiier cu date de antrenare n care atributele reale sunt discretizate


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

35
Aplicaia 3.2. Programul utilizeaz algoritmii C4.5 i ID3
implementai n Weka pentru a construi arborii de decizie afereni fiierelor
generate n aplicaia anterioar. Deoarece algoritmii Weka sunt
implementai n limbajul Java, acetia pot fi compilai sub platforma .NET
n J#, sub form de biblioteci DLL. Ulterior, aceste DLL-uri pot fi
refereniate n orice soluie .NET, inclusiv n proiecte C#, de unde funciile
de construcie a arborilor de decizie pot fi apelate direct.


using System;
using System.Collections.Generic;
using System.Text;


namespace WekaTest
{
class Program
{
static void Main(string[] args)
{
weka.classifiers.trees.J48 c45 = new weka.classifiers.trees.J48();
Console.WriteLine("C45");
Console.Write(weka.classifiers.Evaluation.evaluateModel(
c45, new string[] { "-t", "profiles.arff" }));

double[] attvals = new double[] { 30, 0, 20, -1 };
weka.core.Instance t = new weka.core.Instance(1, attvals);
t.setDataset(c45.sample.dataset());
Console.WriteLine("\r\nClassified as: " + c45.classifyInstance(t));

Console.WriteLine("\r\n\r\nID3");
weka.classifiers.trees.Id3 id3 = new weka.classifiers.trees.Id3(); ;
Console.Write(weka.classifiers.Evaluation.evaluateModel(
id3, new string[] { "-t", "profilesDiscretized.arff" }));

// Medium/2, Tehnic/0, Morning/1 => Mondene/2
attvals = new double[] { 2, 0, 1, -1 };
t = new weka.core.Instance(1, attvals);
Console.WriteLine("\r\nClassified as: " + id3.classifyInstance(t));
}
}
}

ntruct aplicaia de test este de tip consol, pentru efectul estetic
algoritmii au fost rulai pe rnd, iar rezultatele sunt prezentate n figurile 3.3
i 3.4.
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

36

Figura 3.3. Arborele de decizie generat de algoritmul C4.5


Figura 3.4. Arborele de decizie generat de algoritmul ID3


Aplicaia 3.3. Aplicaia web propriu-zis, care permite nregistrarea
i autentificarea utilizatorilor, crearea de profiluri, i prezentarea categoriei
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

37
personalizate de tiri pe baza arborelui de decizie indus din setul de
antrenare.


3.3.1. Login.aspx.cs Pagina de start n care utilizatorul i poate
crea un cont i se poate autentifica.

Proiectarea interfeei grafice este detaliat n figura 3.5.


Figura 3.5. Proiectarea interfeei grafice pentru pagina de autentificare


using System;
using System.IO;


public partial class _Default : System.Web.UI.Page
{
protected void ButtonNew_Click(object sender, EventArgs e)
{
// creeaza un cont nou
Response.Redirect("NewAccount.aspx");
}


protected void ButtonLogin_Click(object sender, EventArgs e)
{
// intra in cont

string username = TextBoxUser.Text;
string password = TextBoxPass.Text;
bool exists = false;
UserProfile up = null;

StreamReader sr = new StreamReader(Server.MapPath(".") + "\\users.txt");
while (sr.Peek() != -1) // citeste pana la sfarsitul fisierului
{
string line = sr.ReadLine(); // citeste cate o linie din fisier
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

38
string[] words = line.Split(); // imparte linia in cuvinte

if (words[0] == username && words[1] == Utils.HashPassword(password))
{
exists = true;
up = new UserProfile();
up.Name = username;
up.Field = words[2];
up.Age = Convert.ToInt32(words[3]);

DateTime now = DateTime.Now;

if (now.Hour < 6) up.Time = "Noapte";
else if (now.Hour < 12) up.Time = "Dimineata";
else if (now.Hour < 18) up.Time = "Zi";
else up.Time = "Seara";

up.ExactTime = now.Hour;

break; // am gasit utilizatorul, iesim din bucla
}
}
sr.Close();

if (!exists) // daca nu exista utilizatorul
{
Response.Redirect("InvalidUser.aspx");
}
else // utilizator corect
{
// redirectare la pagina de stiri
Session["User"] = up;
Response.Redirect("CustomNews.aspx");
}

}
}


3.3.2. Utils.cs Clasele necesare pentru criptarea parolei i pentru
profilul utilizatorului.


using System;
using System.Security.Cryptography;


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

39
public class Utils
{
public static string HashPassword(string pass)
{
HashAlgorithm sha = new SHA1CryptoServiceProvider();
byte[] buf = new byte[pass.Length];
for (int i = 0; i < pass.Length; i++)
buf[i] = (byte)pass[i];
byte[] result = sha.ComputeHash(buf);
return Convert.ToBase64String(result);
}
}

public class UserProfile
{
public string Name; // din autentificare
public string Field; // din profilul inregistrat
public string Time; // din program, momentul logarii
public int ExactTime; // ora logarii
public int Age; // din profilul inregistrat
}


3.3.2. InvalidUser.aspx Pagin de eroare n caz c utilizatorul
greete parola sau introduce un nume de autentificare inexistent.


Figura 3.6. Pagina de eroare la autentificare


3.3.3. NewAccount.aspx.cs Pagina pentru crearea unui cont nou i
completarea profilului de utilizator. Noul cont este introdus n baza de date.
Utilizatorul trebuie s-i completeze detaliile personale. Parola este criptat
nainte de salvarea n baza de date. Astfel, se calculeaz un cod de hash
pentru parol, care este salvat. n momentul n care un utilizator introduce
parola pe pagin, codul de hash se calculeaz din nou, i acesta este
comparat cu codul din baza de date. Acestea nu pot coincide dect dac
nsei parolele coincid. n acest fel, parolele sunt stocate nicieri n clar i
nici nu pot fi determinate pe baza codurilor de hash.

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

40

Figura 3.7. Proiectarea interfeei grafice pentru crearea unui cont nou


using System;
using System.IO;


public partial class NewAccount : System.Web.UI.Page
{
protected void ButtonCreate_Click(object sender, EventArgs e)
{
string username = TextBoxUserName.Text;
string password = TextBoxPass.Text;
int field = DropDownListField.SelectedIndex;
int age = 0;

bool ok = true; // daca datele introduse sunt corecte

try
{
age = Convert.ToInt32(TextBoxAge.Text); // converteste string in int
}
catch // daca in TextBoxAge nu este o valoare numerica
{
ok = false;
}

if (username == "" || password == "" || age < 0 || age > 150)
ok = false;

if (ok)
{
// deschide pentru actualizare/adaugare
StreamWriter sw = new StreamWriter(Server.MapPath(".") + "\\users.txt", true);

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

41
// se poate testa daca nu exista deja un utilizator cu acelasi nume

// parolele NU trebuie stocate in clar
sw.WriteLine(username + "\t" + Utils.HashPassword(password) + "\t" + field +
"\t" + age);

sw.Close();

Response.Redirect("AccountOK.aspx");
}
}
}


n figura 3.8 se prezint baza de date a conturilor utilizatorilor. Se
poate observa numele, codul hash al parolei i datele profilului personal.



Figura 3.8. Baza de date a conturilor


3.3.4. AccountOK.aspx Pagin de confirmare a crerii cu succes a
unui nou cont.

Figura 3.9. Pagina de confirmare a crerii cu succes a unui nou cont


3.3.5. CustomNews.aspx.cs Generarea dinamic a unei pagini de
tiri personalizate, n conformitate cu profilul utilizatorului curent.
Determinarea categoriei de tiri este realizat prin clasificarea profilului
curent cu ajutorul arborelui de decizie construit anterior.


using System;
using System.IO;


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

42
public partial class CustomNews : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
// "Page_Load" se executa cand se incarca pagina

// daca se incarca direct aceasta pagina (customNews.aspx), redirectare la
pagina de Login
if (Session["User"] == null)
Response.Redirect("Login.aspx");

if (Application["DT"] == null) // arborele de decizie se creeaza o singura data
{
InitializeDecisionTree();
}

UserProfile up = (UserProfile)Session["User"];

StreamReader sr = new StreamReader(Server.MapPath(".") + "\\stiri.htm");
string template = sr.ReadToEnd(); // se citeste pagina template
sr.Close();

int noNews = 10;
string[] headlines = null;

string mesaj = "";
switch (up.Time) // se creeaza un mesaj de salut in functie de ora logarii
{
case "Dimineata": mesaj = "Buna dimineata"; break;
case "Zi": mesaj = "Buna ziua"; break;
case "Seara": mesaj = "Buna seara"; break;
case "Noapte": mesaj = "Cam tarziu azi"; break;
}

template = template.Replace("Mesaj", mesaj); // se inlocuieste salutul, de ex.
"Buna ziua"
template = template.Replace("Gigi", up.Name); // se inlocuieste numele in
template

Random r = new Random();

switch (DecideNewsClass(up)) // decizia arborelui de decizie in functie de
profilul utilizatorului
{
case 0:
template = template.Replace("#FF00FF", "#00FF00"); // schimba culoarea
template = template.Replace("gen", "politice"); // se inlocuieste genul de stiri
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

43
ReadHeadlines(Server.MapPath(".") + "\\politice", ref headlines, noNews);
// se incarca titlurile
break;
case 1:
template = template.Replace("#FF00FF", "#FF8080");
template = template.Replace("gen", "sportive");
ReadHeadlines(Server.MapPath(".") + "\\sportive", ref headlines, noNews);
break;
case 2:
template = template.Replace("#FF00FF", "#8080FF");
template = template.Replace("gen", "mondene");
ReadHeadlines(Server.MapPath(".") + "\\mondene", ref headlines, noNews);
break;
default: // should not happen
template = template.Replace("gen", "");
headlines = new string[10];
break;
}

for (int i = 1; i <= 5; i++)
{
int sel = r.Next(noNews);
while (headlines[sel][0] == '*')
sel = r.Next(noNews);
template = template.Replace("News" + i, headlines[sel]);
headlines[sel] = "*" + headlines[sel];
}

StreamWriter sw = new StreamWriter(Server.MapPath(".") + "\\stiri_pers.htm");
sw.Write(template);
sw.Close();

Response.Redirect("stiri_pers.htm");
}


private void ReadHeadlines(string filename, ref string[] headlines, int noNews)
{
// citeste un numar de titluri de stiri
headlines = new string[noNews];
StreamReader sr = new StreamReader(filename);
for (int i = 0; i < noNews && sr.Peek() != -1; i++)
headlines[i] = sr.ReadLine();
sr.Close();
}



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

44
private int DecideNewsClass(UserProfile up) // arborele de decizie
{
double[] attvals = new double[] { up.Age, dom, up.ExactTime, -1 };
weka.core.Instance t = new weka.core.Instance(1, attvals);
weka.classifiers.trees.J48 decisionTree =
(weka.classifiers.trees.J48)Application["DT"];
t.setDataset(decisionTree.sample.dataset());
return decisionTree.classifyInstance(t);
}


private void InitializeDecisionTree()
{
weka.classifiers.trees.J48 decisionTree = new weka.classifiers.trees.J48();
weka.classifiers.Evaluation.evaluateModel(decisionTree,
new string[] { "-t", "profiles.arff" });
Application["DT"] = decisionTree;
}
}
3.3.6. Stiri.htm La generarea paginii personalizate se utilizeaz un
ablon, n care programul completeaz dinamic detalii precum numele
utilizatorului, culoarea fondului i tirile propriu-zise.


Figura 3.10. ablonul paginii de tiri personalizate
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

45
<html>

<head>
<title>Stiri personalizate</title>
</head>

<body>

<table border="1" cellpadding="5" cellspacing="0" style="border-collapse:
collapse" bordercolor="#C0C0C0" width="100%" id="AutoNumber1">
<tr>
<td width="100%" colspan="2" bgcolor="#FF00FF">
<p align="right"><i><b><font face="Arial" size="4">Mesaj, Gigi!</font></b></i>
<p align="center"><b><font face="Arial" size="5">Stirile dumneavoastra gen
preferate</font></b></td>
</tr>
<tr>
<td width="15%" align="left" valign="top"><font face="Arial">
<a href="s1.htm">Stiri politice</a></font><p><font face="Arial">
<a href="s2.htm">Stiri sportive</a></font></p>
<p><font face="Arial"><a href="s3.htm">Stiri mondene</a></font></td>
<td width="85%" align="left" valign="top"><font face="Arial">News1</font><p>
<font face="Arial">News2</font></p>
<p><font face="Arial">News3</font></p>
<p><font face="Arial">News4</font></p>
<p><font face="Arial">News5</font></td>
</tr>
</table>
</body>
</html>


Titlurile de tiri propriu-zise sunt date n trei fiiere text,
corespunztoare fiecrui tip de tiri: politice, sportive i mondene.



Figura 3.11. Exemplu de fiier de tiri politice

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

46


Figura 3.11. Exemplu de fiier de tiri sportive



Figura 3.12. Exemplu de fiier de tiri mondene


Utilizatorul are i posibilitatea de a alege alte tiri pe lng cele
propuse de arborele de decizie. Pentru fiecare tip de tiri exist cte un fiier
corespunztor fiecrui tip de tire.



Figura 3.13. Pagina cu toate tirile politice disponibile


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bazat pe arbori de decizie

47

Figura 3.14. Pagina cu toate tirile sportive disponibile



Figura 3.15. Pagina cu toate tirile mondene disponibile

Aplicaia n execuie poate fi urmrit n figurile urmtoare.


Figura 3.16. Crearea unui cont nou

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

48

Figura 3.17. Pagina de autentificare


Figura 3.18. Pagina de tiri personalizate generat dinamic

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
49
Capitolul 4


INDUCIA REGULILOR DE ASOCIERE


n acest capitol vom prezenta o ultim aplicaie de explorarea a
utilizrii, i anume o aplicaie de achiziii online i analiza coului de
cumprturi prin inducia regulilor de asociere. Algoritmul utilizat n acest
scop, Apriori, are de asemenea o implementare mai laborioas, ca i
algoritmii de inducie a arborilor de decizie, i de aceea am decis utilizarea
implementrii Weka i de aceast dat.

Aplicaia 4.1. Utilizarea implementrii Weka a algoritmului Apriori.

using System;
using System.Collections.Generic;
using System.Text;

namespace WekaTest
{
class Program
{
static void Main(string[] args)
{
weka.associations.Apriori apriori = new weka.associations.Apriori();
int maxRules = 10; double minConfidence = 0.9;
string fileName = "transactions.arff";
string[] options = new string[] { "-t", fileName, "-N", maxRules.ToString(),
"-C", minConfidence.ToString(), "-I" };
java.io.Reader reader = new java.io.BufferedReader(
new java.io.FileReader(fileName));
apriori.setOptions(options);
apriori.buildAssociations(new weka.core.Instances(reader));
Console.Write(apriori.ToString());
}
}
}




id252148093 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

50

Figura 4.1. Fiierul cu tranzaciile analizate


Fiierul ARFF de test este prezentat n figura 4.1. n acest caz, dac
avem n obiecte a cror apariie simultan trebuie analizat, pentru aplicarea
algoritmului Apriori din Weka, fiecrui obiect trebuie s i corespund un
atribut, sau o coloan din partea de date. Obiectele care nu sunt prezente
ntr-o tranzacie vor avea valoare nedeterminat (?) pe coloanele
corespunztoare, iar obiectele prezente vor avea valoarea lor unic pe
coloanele proprii. Rezultatele algoritmului Apriori sunt prezentate n figura
4.2.

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Inducia regulilor de asociere

51

Figura 4.2. Rezultatele algoritmului Apriori


Aplicaia 4.2. Aplicaia de achiziii online, n care fiecare tranzacie
este salvat ntr-un fiier, care apoi poate fi analizat cu algoritmul Apriori.


4.2.1. Default.aspx.cs Pagina principal a site-ului, care afieaz
informaiile despre produsele disponibile. Programul genereaz dinamic
tabelul, pe baza unui fiier cu informaii despre produse definit de utilizator.
i aici se utilizeaz un ablon care este particularizat n momentul execuiei.


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

52
using System;
using System.IO;
using System.Collections;


public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
string path = Server.MapPath(".") + "\\";

StreamReader tmpl = new StreamReader(path + @"templates\buy_tmpl.htm");
string template = tmpl.ReadToEnd();
tmpl.Close();

StreamReader products = new StreamReader(path + @"templates\products.txt");
string line = products.ReadLine();
int noProducts = Convert.ToInt32(line);
MarketBasketItem[] allProducts = new MarketBasketItem[noProducts];

ArrayList buylist = new ArrayList();
Session["BuyList"] = buylist;

string tableLines = "";
double total = 0;

for (int i = 0; i < noProducts; i++)
{
string line1 = products.ReadLine();
string line2 = products.ReadLine();
string line3 = products.ReadLine();
string line4 = products.ReadLine();
string line5 = products.ReadLine();

allProducts[i] = new MarketBasketItem(
line2, line3, line4, Convert.ToDouble(line5));

/*<tr>
<td width="14%" bgcolor="#EEEEEE"><font face="Verdana">ccc</font></td>
<td width="59%" bgcolor="#EEEEEE"><i><font face="Verdana">ttt</font></i></td>
<td width="10%" bgcolor="#EEEEEE"><font face="Verdana">prs</font></td>
<td width="17%" bgcolor="#EEEEEE"><i><font face="Verdana">
<a href="remove.aspx?code=2">sss</a></font></i></td>
</tr>*/

if (i % 2 == 0)
{
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Inducia regulilor de asociere

53
tableLines += "<tr><td width=\"14%\" bgcolor=\"#EEEEEE\"><font
face=\"Verdana\">" + allProducts[i].Category +
"<td width=\"59%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">" +
allProducts[i].Title +
"<td width=\"10%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">" +
allProducts[i].Price +
"<td width=\"17%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">";

string keyword = "Add";
if (buylist.Contains(allProducts[i].Id))
keyword = "Remove";

tableLines += "<a href=\"" + keyword + ".aspx?code=" + allProducts[i].Id +
"\">" + keyword + "</a></font></i></td></tr>";
}
else
{
tableLines += "<tr><td width=\"14%\"><font face=\"Verdana\">" +
allProducts[i].Category +
"<td width=\"59%\"><font face=\"Verdana\">" + allProducts[i].Title +
"<td width=\"10%\"><font face=\"Verdana\">" + allProducts[i].Price +
"<td width=\"17%\"><font face=\"Verdana\">";

string keyword = "Add";

if (buylist.Contains(allProducts[i].Id))
{
keyword = "Remove";
total += allProducts[i].Price;
}

tableLines += "<a href=\"" + keyword + ".aspx?code=" + allProducts[i].Id +
"\">" + keyword + "</a></font></i></td></tr>";
}
}

Session["Products"] = allProducts;

template = template.Replace("productlines", tableLines);
template = template.Replace("nnn", total.ToString());
StreamWriter newpage = new StreamWriter(path + "buypage.htm");

newpage.Write(template);
newpage.Close();

Server.Transfer("buypage.htm");
}
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

54
4.2.2. MarketBasketItem.cs Clasa corespunztoare unui produs
aflat la vnzare.

public class MarketBasketItem
{
public string Category;
public string Id;
public string Title;
public double Price;

public MarketBasketItem(string category, string id, string title, double price)
{
this.Category = category;
this.Id = id;
this.Title = title;
this.Price = price;
}
}
4.2.3. buy_tmpl.htm ablonul care este completat pentru afiarea
listei de produse.



Figura 4.3. ablonul paginii de afiare a produselor


<html>

<head>
<title>Buy Online</title>
</head>

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Inducia regulilor de asociere

55
<body>

<p>&nbsp;</p>

<p><b><font face="Verdana">Featured items on Buy Online</font></b></p>

<p>&nbsp;</p>
<table border="1" cellspacing="1" width="100%" id="AutoNumber1">
<tr>
<td width="14%" bgcolor="#DDDDDD"><b><font
face="Verdana">Category</font></b></td>
<td width="59%" bgcolor="#DDDDDD"><b><font face="Verdana">Item
Title</font></b></td>
<td width="10%" bgcolor="#DDDDDD"><b><font face="Verdana">Price
</font></b></td>
<td width="17%" bgcolor="#DDDDDD">&nbsp;</td>
</tr>
productlines
</table>

<p>&nbsp;</p>
<table border="0" cellspacing="0" width="100%" cellpadding="0"
id="AutoNumber2">
<tr>
<td width="50%">
<p align="left"><i><font
face="Verdana"><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p;&nbsp;Total:</b> nnn</font></i></td>
<td width="50%">
<p align="right"><i><font face="Verdana"><b><a href="complete.aspx">Check
out</a></b></font></i>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;</td>
</tr>
</table>
<p align="right">&nbsp;</p>

</body>

</html>

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

56

Figura 4.4. Fiierul cu informaii despre produsele disponibile


4.2.4. Add.aspx.cs Adugarea unui produs n coul de cumprturi.


using System;
using System.Collections;
using System.IO;


public partial class Add : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
string code = Request.QueryString["code"];
MarketBasketItem[] allProducts = (MarketBasketItem[])Session["Products"];
ArrayList buylist = (ArrayList)Session["BuyList"];

int noProducts = allProducts.Length;
double total = 0;

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Inducia regulilor de asociere

57
for (int i = 0; i < noProducts; i++)
{
if (allProducts[i].Id == code)
{
buylist.Add(code);
//total += allProducts[i].Price;
break;
}
}

string path = Server.MapPath(".") + "\\";
StreamReader tmpl = new StreamReader(path + @"templates\buy_tmpl.htm");
string template = tmpl.ReadToEnd();
tmpl.Close();

string tableLines = "";

for (int i = 0; i < noProducts; i++)
{
if (i % 2 == 0)
{
tableLines += "<tr><td width=\"14%\" bgcolor=\"#EEEEEE\"><font
face=\"Verdana\">" + allProducts[i].Category +
"<td width=\"59%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">" +
allProducts[i].Title +
"<td width=\"10%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">" +
allProducts[i].Price +
"<td width=\"17%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">";
string keyword = "Add";
if (buylist.Contains(allProducts[i].Id))
{
keyword = "Remove";
total += allProducts[i].Price;
}

tableLines += "<a href=\"" + keyword + ".aspx?code=" + allProducts[i].Id +
"\">" + keyword + "</a></font></i></td></tr>";
}
else
{
tableLines += "<tr><td width=\"14%\"><font face=\"Verdana\">" +
allProducts[i].Category +
"<td width=\"59%\"><font face=\"Verdana\">" + allProducts[i].Title +
"<td width=\"10%\"><font face=\"Verdana\">" + allProducts[i].Price +
"<td width=\"17%\"><font face=\"Verdana\">";

string keyword = "Add";

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

58
if (buylist.Contains(allProducts[i].Id))
{
keyword = "Remove";
total += allProducts[i].Price;
}

tableLines += "<a href=\"" + keyword + ".aspx?code=" + allProducts[i].Id +
"\">" + keyword + "</a></font></i></td></tr>";

}
}

template = template.Replace("productlines", tableLines);
template = template.Replace("nnn", total.ToString());
StreamWriter newpage = new StreamWriter(path + "buypage.htm");

newpage.Write(template);
newpage.Close();

Server.Transfer("buypage.htm");
}
}


4.2.5. Remove.aspx.cs Eliminarea unui produs din coul de cumprturi.


using System;
using System.Collections;
using System.IO;


public partial class Remove : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
string code = Request.QueryString["code"];
MarketBasketItem[] allProducts = (MarketBasketItem[])Session["Products"];
ArrayList buylist = (ArrayList)Session["BuyList"];

int noProducts = allProducts.Length;
double total = 0;

for (int i = 0; i < noProducts; i++)
{
if (allProducts[i].Id == code)
{
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Inducia regulilor de asociere

59
buylist.Remove(code);
break;
}
}

string path = Server.MapPath(".") + "\\";
StreamReader tmpl = new StreamReader(path + @"templates\buy_tmpl.htm");
string template = tmpl.ReadToEnd();
tmpl.Close();

string tableLines = "";

for (int i = 0; i < noProducts; i++)
{
if (i % 2 == 0)
{
tableLines += "<tr><td width=\"14%\" bgcolor=\"#EEEEEE\"><font
face=\"Verdana\">" + allProducts[i].Category +
"<td width=\"59%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">" +
allProducts[i].Title +
"<td width=\"10%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">" +
allProducts[i].Price +
"<td width=\"17%\" bgcolor=\"#EEEEEE\"><font face=\"Verdana\">";

string keyword = "Add";
if (buylist.Contains(allProducts[i].Id))
{
keyword = "Remove";
total += allProducts[i].Price;
}

tableLines += "<a href=\"" + keyword + ".aspx?code=" + allProducts[i].Id +
"\">" + keyword + "</a></font></i></td></tr>";
}
else
{
tableLines += "<tr><td width=\"14%\"><font face=\"Verdana\">" +
allProducts[i].Category +
"<td width=\"59%\"><font face=\"Verdana\">" + allProducts[i].Title +
"<td width=\"10%\"><font face=\"Verdana\">" + allProducts[i].Price +
"<td width=\"17%\"><font face=\"Verdana\">";

string keyword = "Add";
if (buylist.Contains(allProducts[i].Id))
{
keyword = "Remove";
total += allProducts[i].Price;
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

60

tableLines += "<a href=\"" + keyword + ".aspx?code=" + allProducts[i].Id +
"\">" + keyword + "</a></font></i></td></tr>";

}
}

template = template.Replace("productlines", tableLines);
template = template.Replace("nnn", total.ToString());
StreamWriter newpage = new StreamWriter(path + "buypage.htm");

newpage.Write(template);
newpage.Close();

Server.Transfer("buypage.htm");
}
}


4.2.6. Complete.aspx Dup finalizarea unei tranzacii, utilizatorul
are posibilitatea efecturii unei noi tranzacii sau vizualizarea rezultatelor
analizei.


Figura 4.5. Opiunile dup finalizarea unei tranzacii


4.2.7. Admin.aspx Modulul de administrare, pentru rularea
algoritmului Apriori i vizualizarea rezultatelor analizei. Aici se creeaz un
fiier ARFF cu tranzaciile efectuate, se aplic algoritmul Apriori pentru
analiza acestuia i se afieaz regulile de asociere gsite.



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Inducia regulilor de asociere

61

Figura 4.6. Proiectarea interfeei pentru vizualizarea rezultatelor analizei


using System;
using System.IO;


public partial class Admin : System.Web.UI.Page
{

protected void Page_Load(object sender, EventArgs e)
{
MarketBasketItem[] allProducts = (MarketBasketItem[])Session["Products"];
int noProducts = allProducts.Length;

string path = Server.MapPath(".") + "\\";
StreamReader transactions = new StreamReader(path + "Database.txt");
StreamWriter arff = new StreamWriter(path + "transactions.arff");
arff.WriteLine("@RELATION transactions");

for (int i = 0; i < noProducts; i++)
arff.WriteLine("@ATTRIBUTE attr" + (i + 1) + " { " + allProducts[i].Id + " }");

arff.WriteLine("@DATA");
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

62
while (transactions.Peek() != -1)
{
string line = transactions.ReadLine();
string[] toks = line.Split();

for (int i = 0; i < noProducts; i++)
{
bool exists = false;
for (int j = 0; j < toks.Length; j++)
if (toks[j] == allProducts[i].Id)
{
exists = true;
break;
}
if (exists)
arff.Write(allProducts[i].Id + " ");
else
arff.Write("? ");
}

arff.WriteLine();
}

transactions.Close();
arff.Close();

weka.associations.Apriori apriori = new weka.associations.Apriori();
int maxRules = 10; double minConfidence = 0.9;
string fileName = path + "transactions.arff";
string[] options = new string[] { "-t", fileName, "-N", maxRules.ToString(),
"-C", minConfidence.ToString(), "-I" };
java.io.Reader reader = new java.io.BufferedReader(
new java.io.FileReader(fileName));
apriori.setOptions(options);
apriori.buildAssociations(new weka.core.Instances(reader));
string s = apriori.ToString();
s = s.Replace("\n", "\r\n");
TextBox1.Text = s;
}
}

Fiierul ARFF (figura 4.8) este generat pe baza unui fiier text
(figura 4.7) care conine toate tranzaciile efectuate. Fiecare produs este
identificat printr-un cod unic de identificare, definit de utilizator n fiierul
cu produse (figura 4.4).

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Inducia regulilor de asociere

63

Figura 4.7. Fiierul text cu tranzaciile efecuate


Figura 4.8. Fiierul ARFF cu tranzaciile efecuate



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea utilizrii

64
n figurile urmtoare este prezentat aplicaia n execuie.


Figura 4.9. Introducerea produselor n coul de cumprturi



Figura 4.10. Regulile de asociere determinate cu algoritmul Apriori
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
EXPLORAREA
CONINUTULUI



Capitolul 5. Reele neuronale
! Predicia traficului pe un site
! Detecia spam-ului

Capitolul 6. Clasificarea bayesian
! Detecia spam-ului
! Relevana cutarilor n documente

Capitolul 7. Partiionarea k-medii
! Vizualizarea procesului de partiionare k-medii
! Partiionarea documentelor


Partea
id252671140 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
67
Capitolul 5


REELE NEURONALE


n acest capitol vom descrie o serie de aplicaii de predicie i
clasificare utiliznd reele neuronale. Problemele cele mai important n lucrul
cu reele neuronale de tip perceptron multistrat sunt determinarea topologiei
reelei (a numrului de straturi ascunse i a numrului de neuroni din fiecare
strat ascuns), i antrenarea reelei, adic gsirea ponderilor conexiunilor i a
deplasrilor neuronilor, astfel nct reeaua s aproximeze funcia dorit.
Determinarea ponderilor se poate realiza cu implementri ale algoritmului
Back-propagation, folosind algoritmi genetici, sau utiliznd un pachet
software dedicat precum NeuroSolutions (http://www.neurosolutions.com). n
cele ce urmeaz, vom considera cunoscui parametrii reelelor utilizate.


Aplicaia 5.1. Testarea unei reele neuronale pentru aproximarea
funciei binare XOR.


5.1.1. Class1.cs Apeleaz reeaua cu cele patru combinaii binare
de intrri i afieaz valoarea de ieire a reelei.


using System;


namespace ConsoleApplication1
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
for (double i = 0; i < 2; i += 1)
for (double j = 0; j < 2; j += 1)
{
double net;
Retea.ReteaXor(i, j, out net);
Console.WriteLine(" " + i + "\t" + j + "\t" + net);
}
id252267046 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

68
Console.Read();
}
}
}


5.1.2. ReteaXor.cs Implementarea unei reele neuronale care
aproximeaz funcia XOR. Se utilizeaz funcia de activare sigmoid
bipolar (tangenta hiperbolic), iar intrrile i ieirile sunt scalate n
domeniul [-0.9, 0.9] pentru a evita zona de saturaie a funciei sigmoide.


using System;


public class Retea
{
public static void ReteaXor(double parin1, double parin2, out double parout1)
{

int inputs = 2; int hidden1 = 3; int outputs = 1;

// pragurile
double[] bias_hid1 = { 1.67026960849762, 1.51845645904541,
1.99274384975433 };
double[] bias_out = { -0.74697744846344 };

// ponderile
double[] weights_in_hid1 = { -2.60063338279724, -1.76398742198944,
1.59563720226288, 2.53694868087769, -1.27316045761108, 1.25908541679382
};
double[] weights_hid1_out = { 1.52559280395508, 1.52687883377075, -
0.786456942558289 };

// factorii de scalare pentru intrari si iesiri
double[] sc_in = { 1.79999995231628, -0.899999976158142,
1.79999995231628, -0.899999976158142 };
double[] sc_out = { 1.79999995231628, -0.899999976158142 };

// scalare intrari
double[] yi = new double[2];
yi[0] = parin1 * sc_in[0] + sc_in[1];
yi[1] = parin2 * sc_in[2] + sc_in[3];

// propagare inainte
double[] yhid1 = new double[hidden1];

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

69
for (int i = 0; i < hidden1; i++)
{
double netinput = 0;
for (int j = 0; j < inputs; j++)
netinput += yi[j] * weights_in_hid1[i * inputs + j];
double xx = netinput + bias_hid1[i];
yhid1[i] = (1 - Math.Exp(-2 * xx)) / (1 + Math.Exp(-2 * xx));
}

double[] yout = new double[outputs];
for (int i = 0; i < outputs; i++)
{
double netinput = 0;
for (int j = 0; j < hidden1; j++)
netinput += yhid1[j] * weights_hid1_out[i * hidden1 + j];
double xx = netinput + bias_out[i];
yout[i] = (1 - Math.Exp(-2 * xx)) / (1 + Math.Exp(-2 * xx));
}

// scalare iesiri
parout1 = (yout[0] - sc_out[1]) / sc_out[0];

}
}

Rezultatele programului sunt afiate n figura 5.1.


Figura 5.1. Rezultatele reelei neuronale pentru funcia XOR



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

70
Aplicaia 5.2. Predicia traficului pe un site web pe baza
nregistrrilor anterioare. Ca atribute, se iau n considerare luna
calendaristic i dac este sau nu sfrit de sptmn (smbt sau
duminic).

5.2.1. Default.aspx.cs Creeaz o imagine cu prediciile privind
volumul traficului pe site i ncarc pagina TrafficPage.htm pentru a o afia.
La rndul su, este apelat din TrafficPage.htm pentru a schimba luna de
referin pentru predicie.


using System;
using System.Drawing;


public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
int luna = 0;

if (Session["Data"] == null) // la inceput
{
luna = DateTime.Now.Month; // data curenta
Session["Data"] = luna;
}
else // avem un apel din TrafficPage.htm
{
if (Request.QueryString["dir"] == null)
Response.Redirect("TrafficPage.htm"); // should not happen - only if called directly
string type = Request.QueryString["dir"]; // apel: Default.aspx?dir=0

luna = (int)Session["Data"];
if (type == "0")
luna--;
else if (type == "1")
luna++;

if (luna <= 0) luna += 12;
if (luna > 12) luna -= 12;
Session["Data"] = luna;
}

// avem noua referinta (luna)
Bitmap b = Deseneaza(luna);
b.Save(Server.MapPath(".") + "\\image.png", System.Drawing.Imaging.ImageFormat.Png);
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

71
Response.Redirect("TrafficPage.htm");
}


private Bitmap Deseneaza(int luna)
{
const int width = 312;
const int height = 200;

Bitmap b = new Bitmap(width, height);
Graphics g = Graphics.FromImage(b);

g.Clear(Color.White);

int lastX = 0, lastY = 0;
bool start = true;

for (int zi = 0; zi < 90; zi++)
{
double luna_reala = luna + zi / 30.0;
if (luna_reala > 12) luna_reala -= 12;
int rest = zi % 30;

bool weekend = false;
if (rest % 7 == 5 || rest % 7 == 6)
weekend = true;

double trafic = 0;

if (weekend)
Retea.Trafic(luna_reala + 1, 1, out trafic);
else
Retea.Trafic(luna_reala + 1, 0, out trafic);

if (start)
{
lastX = (int)(zi * width / 90.0);
lastY = (int)(height - (trafic * height / 100.0));
start = false;
}

int crtX = (int)(zi * width / 90.0);
int crtY = (int)(height - (trafic * height / 100.0));
g.DrawLine(Pens.Blue, lastX, lastY, crtX, crtY);
lastX = crtX; lastY = crtY;
}

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

72
string[] numeLuni = {"ianuarie", "februarie", "martie", "aprilie", "mai", "iunie",
"iulie", "august", "septembrie", "octombrie", "noiembrie", "decembrie"};

g.DrawString(numeLuni[luna - 1],
new Font("Arial", 10), Brushes.Black, 10, height - 25);
return b;
}
}


Figura 5.2. Fiierul cu date de antrenare pentru reeaua neuronal


5.2.2. Trafic.cs Implementarea reelei neuronale pentru modelarea
traficului pe site-ul web.


using System;


public class Retea
{
public static void Trafic(double parin1, double parin2, out double parout1)
{

int inputs = 2; int hidden1 = 12; int hidden2 = 4; int outputs = 1;

// pragurile
double[] bias_hid1 = { 0.827310502529144, 0.875930905342102, -
1.18250799179077, 0.36259326338768, 0.00145985779818147,
0.765919148921967, -0.739576458930969, -0.584793865680695, -
0.23956073820591, 0.865360915660858, -0.454799801111221,
0.939562618732452 };
double[] bias_hid2 = { 0.22891254723072, 0.325656056404114,
0.495092004537582, 0.566707372665405 };
double[] bias_out = { 0.0429582744836807 };

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

73
// ponderile
double[] weights_in_hid1 = { 0.204264059662819, -0.534760594367981, -
2.35886168479919, 0.139574393630028, 3.59688496589661, -
0.555204212665558, 0.729041218757629, 0.12687386572361,
1.4487851858139, -1.1591819524765, 0.287839323282242, -
0.742211878299713, 1.74276697635651, 0.145347908139229,
2.7674617767334, -0.664727210998535, 1.46303772926331, -
1.53156304359436, 5.21500730514526, 0.222574606537819, -
0.0501896813511849, 0.934899568557739, -0.0189710091799498, -
0.309031486511231 };
double[] weights_hid1_hid2 = { 0.178517088294029, -0.203106388449669,
0.629495143890381, 0.174335151910782, 0.55248761177063,
0.312326073646545, 0.0114415995776653, 0.77429336309433,
0.237478882074356, 1.45006108283997, -0.243590921163559,
0.249050483107567, 0.534767210483551, -0.289859980344772,
0.983765244483948, -0.0520153492689133, -0.158270135521889,
0.057493656873703, 0.522488474845886, 0.724116146564484, -
0.525241792201996, -1.384685754776, -0.422177046537399,
0.297375977039337, 0.53504353761673, 0.0308731123805046, -
0.296064227819443, 0.181394636631012, -0.400412380695343,
0.178471118211746, 0.141643166542053, 0.0696015805006027, -
0.585455358028412, 0.323086082935333, -0.295780688524246,
0.23926268517971, -0.37297111749649, -0.636192858219147,
0.669481873512268, -0.22807066142559, -0.369423091411591,
0.217889681458473, 0.0658376812934875, 0.160979256033897, -
0.493818521499634, -0.519814372062683, -0.342704594135284,
0.0732858180999756 };
double[] weights_hid2_out = { 0.678239941596985, -0.886533319950104, -
0.346319556236267, -0.534705698490143 };

// factorii de scalare pentru intrari si iesiri
double[] sc_in = { 0.163636356592178, -1.063636302948,
1.79999995231628, -0.899999976158142 };
double[] sc_out = { 0.0240000002086163, -1.5 };

// scalare intrari
double[] yi = new double[2];
yi[0] = parin1 * sc_in[0] + sc_in[1];
yi[1] = parin2 * sc_in[2] + sc_in[3];

// propagare inainte
double[] yhid1 = new double[hidden1];
for (int i = 0; i < hidden1; i++)
{
double netinput = 0;
for (int j = 0; j < inputs; j++)
netinput += yi[j] * weights_in_hid1[i * inputs + j];
double xx = netinput + bias_hid1[i];
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

74
yhid1[i] = (1 - Math.Exp(-2 * xx)) / (1 + Math.Exp(-2 * xx));
}

double[] yhid2 = new double[hidden2];
for (int i = 0; i < hidden2; i++)
{
double netinput = 0;
for (int j = 0; j < hidden1; j++)
netinput += yhid1[j] * weights_hid1_hid2[i * hidden1 + j];
double xx = netinput + bias_hid2[i];
yhid2[i] = (1 - Math.Exp(-2 * xx)) / (1 + Math.Exp(-2 * xx));
}

double[] yout = new double[outputs];
for (int i = 0; i < outputs; i++)
{
double netinput = 0;
for (int j = 0; j < hidden2; j++)
netinput += yhid2[j] * weights_hid2_out[i * hidden2 + j];
double xx = netinput + bias_out[i];
yout[i] = (1 - Math.Exp(-2 * xx)) / (1 + Math.Exp(-2 * xx));
}

// scalare iesiri
parout1 = (yout[0] - sc_out[1]) / sc_out[0];

}
}


5.2.3. TrafficPage.htm Pagina care afieaz prediciile. Se remarc
utilizarea unui script Javascript pentru a afia de fiecare dat imaginea nou
generat, evitnd astfel afiarea imaginii existente deja n cache-ul
browser-ului.

<html>
<head>
<title>WebSite Traffic</title>
<script>
function update_src()
{
var dt = new Date();
var newsrc = "image.png?" + dt.getTime();
document.getElementById("myimage").src = newsrc;
}
onload = update_src;
</script>
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

75
</head>

<body >
<p align="center"><b><font size="6">WebSite Traffic</font></b></p>
<p align="center">
<a href="Default.aspx?dir=0"><img border="0" src="back.jpg" width="16"
height="15"></a>&nbsp;&nbsp;
<img border="1" width="312" height="200" src="image.png"
id="myimage">&nbsp;&nbsp;
<a href="Default.aspx?dir=1"><img border="0" src="forward.jpg" width="16"
height="15"></a></p>
</body>

</html>


Figura 5.3. Aplicaia de predicie n timpul execuiei

Aplicaia 5.3. Este o aplicaie complex de detecie a spam-ului, pe
baza unor email-uri efective. Se va descrie i un client simplu pentru
trimiterea i primirea de email-uri. Pe baza modelului nvat de reeaua
neuronal, mesajele vor fi clasificate ca spam (mesaje nesolicitate) sau
ham (mesaje utile). n acest scop, datele de antrenare vor fi vectori de
frecven ai cuvintelor propriu-zise din mesaje.


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

76
5.3.1. Program.cs Programul principal unde se instaniaz un
obiect din clasa Analyser, care produce clasificarea.


using System;

namespace SpamAnalysis
{
class Program
{
static void Main(string[] args)
{
Analyser an = new Analyser();
an.Analyse();
}
}
}


5.3.2. Analyser.cs Creeaz un model al vectorilor de frecven ai
cuvintelor pe baza unui set de documente, pentru a clasifica mesajele de
email n spam i ham.


using System;
using System.Collections.Generic;
using System.Collections;
using System.Text;
using System.IO;

namespace SpamAnalysis
{
class Analyser
{
public Analyser()
{
}

public void Analyse()
{
const int noDocs = 30;
string[] filenames = new string[noDocs];

for (int i = 0; i < 15; i++)
{
filenames[i] = string.Format("ham{0:D2}", i + 1);
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

77
filenames[i + 15] = string.Format("spam{0:D2}", i + 1);
}

Hashtable wordfreq = new Hashtable();
Hashtable wordfreqindocs = new Hashtable();
int[] docLengths = new int[noDocs];

for (int i = 0; i < noDocs; i++)
{
StreamReader sr = new StreamReader("mail\\" + filenames[i]);
string file = sr.ReadToEnd();
sr.Close();
string[] toks = RemoveEmptyStrings(file.Split(
" \t\r\n,.;:\"/<>~_`!@#$%^&*-+=?()[]{}0123456789".ToCharArray()));

docLengths[i] = toks.Length;

for (int j = 0; j < toks.Length; j++)
{
string word = toks[j].ToLower() + "|" + filenames[i];
if (wordfreqindocs.Contains(word))
{
int noOccurences = (int)wordfreqindocs[word];
noOccurences++;
wordfreqindocs.Remove(word);
wordfreqindocs.Add(word, noOccurences);
}
else
wordfreqindocs.Add(word, 1);

word = toks[j].ToLower();
if (wordfreq.Contains(word))
{
int noOccurences = (int)wordfreq[word];
noOccurences++;
wordfreq.Remove(word);
wordfreq.Add(word, noOccurences);
}
else
wordfreq.Add(word, 1);
}
}

// analysis

string[] terms = new string[wordfreq.Count];
double[] termWeights = new double[wordfreq.Count];
int tc = 0;
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

78
foreach (string t in wordfreq.Keys)
{
terms[tc] = t;
termWeights[tc] = 0;

// gaseste ponderea termenului t in toate documentele, "inverse document
frequency": idf = log( |D| / |D > t| )
int docCount = 0;
for (int d = 0; d < noDocs; d++)
{
string word_doc = t + "|" + filenames[d];
if (wordfreqindocs.Contains(word_doc))
{
docCount++;
break;
}
}

double idf = Math.Log((double)noDocs / (double)docCount);

// gaseste ponderea termenului t in fiecare document, "term frequency": tf =
n(i) / sum_k(n(k))
for (int d = 0; d < noDocs; d++)
{
string word_doc = t + "|" + filenames[d];
int noOccurences = 0;
if (wordfreqindocs.Contains(word_doc))
noOccurences = (int)wordfreqindocs[word_doc];
double tf = (double)noOccurences / (double)docLengths[d];

double tf_idf = tf * idf;
termWeights[tc] += tf_idf / (double)noDocs;
}

tc++;
}

SortTermWeights(terms, termWeights);

// exporta cuvintele - pentru crearea filtrului
StreamWriter swall = new StreamWriter("words.txt");
for (int i = 0; i < terms.Length; i++)
swall.WriteLine(terms[i] + "\t" + termWeights[i]);
swall.Close();

StreamReader srFilter = new StreamReader("wordfilter.txt");
string filterFile = srFilter.ReadToEnd();
srFilter.Close();
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

79
string[] toksComWrd = filterFile.Split(" \t\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
List<string> commonWords = new List<string>();
for (int i = 0; i < toksComWrd.Length; i++)
if (!commonWords.Contains(toksComWrd[i])) // se elimina cuvintele
duplicat daca exista
commonWords.Add(toksComWrd[i].ToLower()); // prelucrare case
insensitive

int maxAttributeTerms = 20;
int[] chosenWordsIndices = new int[maxAttributeTerms];
int cwiCount = 0;

for (int i = 0; i < terms.Length; i++)
{
if (!commonWords.Contains(terms[i]))
{
chosenWordsIndices[cwiCount++] = i;
if (cwiCount >= maxAttributeTerms)
break;
}
}

// exporta cuvintele selecate ca atribute
StreamWriter sw = new StreamWriter("chosen_words.txt");
for (int i = 0; i < maxAttributeTerms; i++)
sw.WriteLine(terms[chosenWordsIndices[i]]);
sw.Close();

// creeaza vectorii de antrenare pentru reteaua neuronala
string strVectors = "";

// scrie antetul
for (int j = 0; j < maxAttributeTerms; j++)
strVectors += terms[chosenWordsIndices[j]] + "\t";
strVectors += "IsSpam\r\n";

// scrie vectorii propriu-zisi -> 1 pentru fiecare document
for (int d = 0; d < noDocs; d++)
{
for (int j = 0; j < maxAttributeTerms; j++)
{
string word_doc = terms[chosenWordsIndices[j]] + "|" + filenames[d];
int noOccurences = 0;
if (wordfreqindocs.Contains(word_doc))
noOccurences = (int)wordfreqindocs[word_doc];
double tf = (double)noOccurences / (double)docLengths[d];

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

80
strVectors += string.Format("{0:F3}\t", tf * 100); // tf-urile pentru vectorii
selectati se pot normaliza (suma = 1)
}
if (d >= 15)
strVectors += "1\r\n"; // class = is spam
else
strVectors += "0\r\n"; // class = is not spam
}

StreamWriter vectors = new StreamWriter("nn_vectors.txt");
vectors.Write(strVectors);
vectors.Close();

Console.WriteLine("OK");
}


private void SortTermWeights(string[] terms, double[] termWeights)
{
int dist, i, j;
string auxs;
double auxd;
int len = terms.Length;

for (dist = len / 2; dist > 0; dist /= 2)
for (i = dist; i < len; i++)
for (j = i - dist; j >= 0 && termWeights[j] < termWeights[j + dist]; j -= dist)
{
auxs = terms[j]; terms[j] = terms[j + dist]; terms[j + dist] = auxs;
auxd = termWeights[j]; termWeights[j] = termWeights[j + dist];
termWeights[j + dist] = auxd;
}
}


private string[] RemoveEmptyStrings(string[] str)
{
int count = 0;
for (int i = 0; i < str.Length; i++)
if (str[i] != null && str[i] != "")
count++;
if (count == 0)
return null;
string[] newstr = new string[count];
count = 0;
for (int i = 0; i < str.Length; i++)
if (str[i] != null && str[i] != "")
newstr[count++] = str[i];
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

81
return newstr;
}
}
}



Figura 5.4. Exemple de documente spam i ham utilizate pentru antrenare



Figura 5.5. Cuvintele comune ignorate la prelucrare fiierul wordfilter.txt
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

82

Figura 5.6. Fiierul words.txt



Figura 5.7. Fiierul chosen_words.txt


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

83

Figura 5.8. Fiierul nn_vectors.txt


Aplicaia 5.4. Implementarea unui client de e-mail, cu care se pot
trimite mesaje, iar dintre mesajele primite sunt detectate spam-urile.


5.4.1. inbox.aspx.cs Afieaz inbox-ul clientului (mesajele ham).


using System;
using System.IO;


public partial class inbox : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
string path = Server.MapPath(".");

StreamReader tmpl = new StreamReader(path + @"\mail\inbox_tmpl.htm");
string template = tmpl.ReadToEnd();
tmpl.Close();

string table = "";
StreamReader toc = new StreamReader(path + @"\mail\messages.toc");
while (toc.Peek() != -1)
{
string line = toc.ReadLine();
string[] toks = line.Split('|');

if (toks[3] != "0")
continue;
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

84
/* <tr>
<td width="30%">morefrom</td>
<td width="70%"><a href="show.aspx?file=msg01">moresubject</td>
</tr>*/

table += "<tr>\r\n<td width=\"30%\">" + toks[0] + "</td>\r\n";
table += "<td width=\"70%\"><a href=\"show.aspx?file=" + toks[2] + "\">" +
toks[1] + "</td>\r\n</tr>";
}

/*for (int i=0; i<100; i++)
{
table += "<tr>\r\n<td width=\"30%\">" + "toks[0]" + "</td>\r\n";
table += "<td width=\"70%\"><a href=\"show.aspx?file=" + "toks[2]" + "\">" +
"toks[1]" + "</td>\r\n</tr>";
}*/

toc.Close();

StreamWriter inbox = new StreamWriter(path + "\\mail.htm");
inbox.Write(template.Replace("moremessages", table));
inbox.Close();

Server.Transfer("mail.htm");

}
}

5.4.2. inbox_tmpl.htm ablonul particularizat pentru afiarea inbox-ului.


<html>

<head><title>Webmail client</title></head>

<body>
<div align="center">
<center>
<table border="2" cellspacing="0" width="800" id="AutoNumber1"
bordercolorlight="#C0C0C0" bordercolordark="#808080" style="border-collapse:
collapse" bordercolor="#111111" cellpadding="0">
<tr>
<td width="100%" colspan="2" height="71">
<p align="center"><b><font size="5" face="Arial">Webmail
client</font></b></td>
</tr>
<tr>
<td width="11%" align="left" valign="top" height="133"><font face="Arial">
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

85
<a href="inbox.aspx">Inbox</a></font><p>
<font face="Arial">
<a href="compose.aspx">Compose</a></font></p>
<p><font face="Arial">
<a href="spam.aspx">Spam</a></font></p>
<p>&nbsp;</td>
<td width="89%" align="left" valign="top" height="133" bgcolor="#E9E9E9">
<table border="1" cellpadding="0" cellspacing="0" style="border-collapse:
collapse" bordercolor="#C0C0C0" width="100%" id="AutoNumber2">
<tr>
<td width="30%"><b><font face="Arial">From</font></b></td>
<td width="70%"><b><font face="Arial">Subject</font></b></td>
</tr>
moremessages
</table>
</td>
</tr>
</table>

</center>
</div>

</body>

</html>



Figura 5.9. Indexul de mesaje



5.4.3. compose.aspx.cs Pagin utilizat pentru a compune i
trimite un mesaj folosind localhost-ul.


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

86

Figura 5.10. Proiectarea interfeei pentru pagina de trimitere a unui e-mail


using System;
using System.Web.Mail;


public partial class compose : System.Web.UI.Page
{
protected void ButtonSend_Click(object sender, EventArgs e)
{
MailMessage eMail = new MailMessage();
eMail.BodyFormat = MailFormat.Text;

eMail.To = TextBoxTo.Text;
eMail.From = TextBoxFrom.Text;

eMail.Subject = TextBoxSubject.Text;

eMail.Body = TextBoxBody.Text;

SmtpMail.SmtpServer = "127.0.0.1";
eMail.Fields["http://schemas.microsoft.com/cdo/configuration/sendusing"] = 1;
eMail.Fields["http://schemas.microsoft.com/cdo/configuration/
smtpserverpickupdirectory"] = "C:\\Inetpub\\mailroot\\Pickup";

SmtpMail.Send(eMail);
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

87
// close
Response.Redirect("inbox.aspx");
}



protected void ButtonCancel_Click(object sender, EventArgs e)
{
Response.Redirect("inbox.aspx");
}
}


5.4.4. receive.cs Codul pentru preluarea email-urilor de pe un
server POP3 este mai complex. Implementarea de mai jos se bazeaz pe
codul surs al lui William J. Dean, bazat pe tutorialul lui Agus Kurniawan,
Retrieve Mail From a POP3 Server Using C#, disponibil la adresa:
http://www.codeproject.com/csharp/popapp.asp


using System;
using System.IO;
using System.Net;
using System.Net.Sockets;


namespace WebMail
{
class MailReceiver
{
public void Receive()
{
POP3Client pop3 = new POP3Client();

Console.WriteLine("Connecting to server:");
Console.WriteLine(pop3.Connect("pop.gmail.com"));
Console.WriteLine("Issuing USER");
Console.WriteLine(pop3.User("asp_nt"));
Console.WriteLine("Issuing PASS");
Console.WriteLine(pop3.Pass("poppass"));
Console.WriteLine("Issuing STAT");
Console.WriteLine(pop3.Stat());
Console.WriteLine("Issuing LIST");
Console.WriteLine(pop3.List());
Console.WriteLine("Issuing RETR 1");
Console.WriteLine(pop3.Retr(1));
Console.WriteLine("Issuing QUIT");
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

88
Console.WriteLine(pop3.Quit());

Console.ReadKey();
}
}


public class POP3Client
{
public enum ConnectState { Disconnect, Authorization, Transaction, Update };

public string User;
public string Password;
public string PopServer;
public bool Error;
public ConnectState State = ConnectState.Disconnect;

private TcpClient server;
private NetworkStream netStream;
private StreamReader sr;
private string data;
private byte[] bData;
private string crlf = "\r\n";


public POP3Client(string popServer, string userName, string password)
{
PopServer = popServer;
User = userName;
Password = password;
}


public string Connect(string popServer)
{
PopServer = popServer; //put the specified server into the pop property
return Connect(); //call the connect method
}


public string Connect()
{
// create server with port 110
server = new TcpClient(PopServer, 110);

try
{
// initialization
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

89
netStream = server.GetStream();
sr = new StreamReader(server.GetStream());

//The pop session is now in the AUTHORIZATION state
State = ConnectState.Authorization;
return sr.ReadLine();
}
catch (InvalidOperationException err)
{
return "Error: " + err.ToString();
}

}


private string Disconnect()
{
string temp = "disconnected successfully.";
if (State != ConnectState.Disconnect)
{

//close connection
netStream.Close();
sr.Close();
State = ConnectState.Disconnect;
}
else
{
temp = "Not Connected.";
}
return temp;
}
private void IssueCommand(string command)
{
data = command + crlf;
bData = System.Text.Encoding.ASCII.GetBytes(data.ToCharArray());
netStream.Write(bData, 0, bData.Length);
}


private string ReadSingleLineResponse()
{
string temp;
try
{
temp = sr.ReadLine();
WasPopError(temp);
return temp;
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

90
}
catch (InvalidOperationException err)
{
return "Error in read_single_line_response(): " + err.ToString();
}
}
private string ReadMultiLineResponse()
{
string temp = "";
string szTemp;

try
{
szTemp = sr.ReadLine();
WasPopError(szTemp);
if (!Error)
{
while (szTemp != ".")
{
temp += szTemp + crlf;
szTemp = sr.ReadLine();
}
}
else
{
temp = szTemp;
}
return temp;
}
catch (InvalidOperationException err)
{
return "Error in read_multi_line_response(): " + err.ToString();
}
}


private void WasPopError(string response)
{
//detect if the pop server that issued the response believes that
//an error has occured.

if (response.StartsWith("-"))
{
//if the first character of the response is "-" then the
//pop server has encountered an error executing the last
//command send by the client
Error = true;
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

91
else
{
//success
Error = false;
}
}


public string Dele(int msgNumber)
{
string temp;

if (State != ConnectState.Transaction)
{
//DELE is only valid when the pop session is in the TRANSACTION STATE
temp = "Connection state not = TRANSACTION";
}
else
{
IssueCommand("DELE " + msgNumber.ToString());
temp = ReadSingleLineResponse();
}
return temp;
}


public string List()
{
string temp = "";
if (State != ConnectState.Transaction)
{
//the pop command LIST is only valid in the TRANSACTION state
temp = "Connection state not = TRANSACTION";
}
else
{
IssueCommand("LIST");
temp = ReadMultiLineResponse();
}
return temp;
}


public string List(int msgNumber)
{
string temp = "";


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

92
if (State != ConnectState.Transaction)
{
//the pop command LIST is only valid in the TRANSACTION state
temp = "Connection state not = TRANSACTION";
}
else
{
IssueCommand("LIST " + msgNumber.ToString());
temp = ReadSingleLineResponse(); //when the message number is
supplied, expect a single line response
}
return temp;
}


public string NoOp()
{
string temp;
if (State != ConnectState.Transaction)
{
//the pop command NOOP is only valid in the TRANSACTION state
temp = "Connection state not = TRANSACTION";
}
else
{
IssueCommand("NOOP");
temp = ReadSingleLineResponse();
}
return temp;
}


public string Pass()
{
string temp;
if (State != ConnectState.Authorization)
{
//the pop command PASS is only valid in the AUTHORIZATION state
temp = "Connection state not = AUTHORIZATION";
}
else
{
if (Password != null)
{
IssueCommand("PASS " + Password);
temp = ReadSingleLineResponse();


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

93
if (!Error)
{
//transition to the Transaction state
State = ConnectState.Transaction;
}
}
else
{
temp = "No Password set.";
}
}
return temp;
}


public string Pass(string password)
{
Password = password; //put the supplied password into the appropriate
property
return Pass(); //call PASS() with no arguement
}


public string Quit()
{
//QUIT is valid in all pop states
string temp;
if (State != ConnectState.Disconnect)
{
IssueCommand("QUIT");
temp = ReadSingleLineResponse();
temp += crlf + Disconnect();
}
else
{
temp = "Not Connected.";
}
return temp;
}


public string Retr(int msg)
{
string temp = "";
if (State != ConnectState.Transaction)
{
//the pop command RETR is only valid in the TRANSACTION state
temp = "Connection state not = TRANSACTION";
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

94
}
else
{
// retrieve mail with number mail parameter
IssueCommand("RETR " + msg.ToString());
temp = ReadMultiLineResponse();
}
return temp;
}


public string Rset()
{
string temp;
if (State != ConnectState.Transaction)
{
//the pop command STAT is only valid in the TRANSACTION state
temp = "Connection state not = TRANSACTION";
}
else
{
IssueCommand("RSET");
temp = ReadSingleLineResponse();
}
return temp;
}


public string Stat()
{
string temp;
if (State == ConnectState.Transaction)
{
IssueCommand("STAT");
temp = ReadSingleLineResponse();
return temp;
}
else
{
//the pop command STAT is only valid in the TRANSACTION state
return "Connection state not = TRANSACTION";
}
}


public string User()
{
string temp;
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

95
if (State != ConnectState.Authorization)
{
//the pop command USER is only valid in the AUTHORIZATION state
temp = "Connection state not = AUTHORIZATION";
}
else
{
if (User != null)
{
IssueCommand("USER " + User);
temp = ReadSingleLineResponse();
}
else
{ //no user has been specified
temp = "No User specified.";
}
}
return temp;
}


public string User(string userName)
{
User = userName; //put the user name in the appropriate propertity
return User(); //call USER with no arguements
}
}
}


5.4.5. show.aspx.cs Afieaz un anumit e-mail.


using System;
using System.IO;


public partial class show : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
if (Request.QueryString["file"] == null)
Response.Redirect("inbox.aspx");
string messageFile = Request.QueryString["file"];

string path = Server.MapPath(".");

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

96
StreamReader tmpl = new StreamReader(path + @"\mail\index_tmpl.htm");
string template = tmpl.ReadToEnd();
tmpl.Close();

StreamReader actualMail = new StreamReader(path + "\\mail\\" +
messageFile);
string mail = actualMail.ReadToEnd();
actualMail.Close();

StreamWriter page = new StreamWriter(path + "\\mail.htm");
page.Write(template.Replace("maincontents", mail));
page.Close();
Server.Transfer("mail.htm");
}
}


5.4.6. spam.aspx.cs n mod analog cu inbox-ul, afieaz lista de
mesaje spam.


using System;
using System.IO;


public partial class spam : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
string path = Server.MapPath(".");
StreamReader tmpl = new StreamReader(path + @"\mail\inbox_tmpl.htm");
string template = tmpl.ReadToEnd();
tmpl.Close();

string table = "";
StreamReader toc = new StreamReader(path + @"\mail\messages.toc");
while (toc.Peek() != -1)
{
string line = toc.ReadLine();
string[] toks = line.Split('|');
if (toks[3] != "1")
continue;

/* <tr>
<td width="30%">morefrom</td>
<td width="70%"><a href="show.aspx?file=msg01">moresubject</td>
</tr>*/

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Reele neuronale

97
table += "<tr>\r\n<td width=\"30%\">" + toks[0] + "</td>\r\n";
table += "<td width=\"70%\"><a href=\"show.aspx?file=" + toks[2] + "\">" +
toks[1] + "</td>\r\n</tr>";
}

/*for (int i=0; i<100; i++)
{
table += "<tr>\r\n<td width=\"30%\">" + "toks[0]" + "</td>\r\n";
table += "<td width=\"70%\"><a href=\"show.aspx?file=" + "toks[2]" + "\">" +
"toks[1]" + "</td>\r\n</tr>";
}*/
toc.Close();
StreamWriter inbox = new StreamWriter(path + "\\mail.htm");
inbox.Write(template.Replace("moremessages", table));
inbox.Close();

Server.Transfer("mail.htm");
}
}




Figura 5.11. Aplicaia de client web n execuie



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

98

Figura 5.12. Afiarea unui e-mail




Figura 5.13. Afiarea listei de mesaje spam



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
99
Capitolul 6


CLASIFICAREA BAYESIAN


n acest capitol vom prezenta o abordare alternativ pentru
clasificarea mesajelor de e-mail, bazat pe probabiliti, cu ajutorul
algoritmului de clasificare bayesian naiv. De asemenea, vom propune o
aplicaie probabilistic pentru determinarea documentelor n care apar
anumite concepte ce fac obiectul cutrilor utilizatorilor.


Aplicaia 6.1. Deoarece clasificarea bayesian naiv presupune
discretizarea valorilor reale ale datelor, se vor discretiza vectorii de
frecven ai cuvintelor, ca un pas de preprocesare. Se vor distinge valorile
nule, care vor reprezenta o valoare discret distinct, ntruct numrul
valorilor nule este destul de mare n cazul vectorilor de frecven ai
cuvintelor.


using System;
using System.Collections.Generic;
using System.Text;
using System.IO;


namespace Preprocessing
{
class Program
{
static void Main(string[] args)
{
Console.Write("Numarul de intervale: ");
int nrIntervale = Convert.ToInt32(Console.ReadLine());

int nrCol = 20;
int nrLin = 30;

double[,] date = new double[nrLin, nrCol + 1];

StreamReader sr = new StreamReader("vectors.txt");
string all = sr.ReadToEnd();
sr.Close();

id252344953 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

100
string[] toks = all.Split(" \t\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < nrLin; i++)
for (int j = 0; j < nrCol + 1; j++)
date[i, j] = Convert.ToDouble(toks[i * (nrCol + 1) + j]);

double[] min = new double[nrCol];
double[] max = new double[nrCol];
for (int i = 0; i < nrCol; i++)
{
min[i] = double.MaxValue;
max[i] = double.MinValue;

for (int j = 0; j < nrLin; j++)
{
if (min[i] > date[j, i] && date[j, i] > 0)
min[i] = date[j, i];
if (max[i] < date[j, i] && date[j, i] > 0)
max[i] = date[j, i];
}
}

StreamWriter sw = new StreamWriter("vector_discr.txt");

for (int i = 0; i < nrLin; i++)
{
for (int j = 0; j < nrCol; j++)
{
if (date[i, j] == 0)
sw.Write("Null\t"); // separam frecventele nule
else if (min[j] == max[j])
sw.Write("Val1\t"); // o singura valoare nenula
else
{
int interv = (int)((date[i, j] - min[j]) * nrIntervale / (max[j] - min[j]) + 1);
if (interv > nrIntervale)
interv = nrIntervale;
sw.Write("Val" + interv + "\t");
}
}

if (date[i, nrCol] == 0) // clasa
sw.WriteLine("Ham");
else
sw.WriteLine("Spam");
}

sw.Close();
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bayesian

101
}
}
}



Figura 6.1. Vectorii de frecven ai cuvintelor cu valori reale



Figura 6.2. Vectorii de frecven ai cuvintelor cu valori discretizate


Aplicaia 6.2. Clasificarea mesajelor de e-mail prin prelucrarea
vectorilor de frecven ai cuvintelor prin metoda bayesian naiv cu corecie
Laplace.



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

102

Figura 6.3. Proiectarea interfeei grafice pentru aplicaia de
clasificare bayesian naiv a e-mail-urilor


using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.IO;


namespace NaiveBayes
{
public partial class Form1 : Form
{
string[,] date;
int nrCol = 20;
int nrLin = 30;
int nrInterv = 3;
int nrClase = 2; // 2 clase: Ham, Spam
string[] classLabel = new string[] { "Ham", "Spam" };
string[] attrVal = new string[] { "Null", "Val1", "Val2", "Val3" };


public Form1()
{
InitializeComponent();
}

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bayesian

103
private void exitToolStripMenuItem_Click(object sender, EventArgs e)
{
Close();
}
private void classifyToolStripMenuItem_Click(object sender, EventArgs e)
{
// clasificare
ReadData("vector_discr.txt");

textBox.Clear();

for (int i = 0; i < nrLin; i++)
{
string[] v = new string[nrCol];
for (int j = 0; j < nrCol; j++)
v[j] = date[i, j];

double[] bayesProb = new double[nrClase]; // probabilitatile finale

int cl = ClassifyVector(v, bayesProb);

textBox.Text += "Vectorul " + (i + 1) + " - " + classLabel[cl] + "\t\t";

for (int j = 0; j < nrClase; j++)
textBox.Text += string.Format("{0:F6} pml\t", bayesProb[j] * 1e+6);

textBox.Text += Environment.NewLine;
}
}


private int ClassifyVector(string[] vect, double[] bayesProb)
{
double[] classProb = new double[nrClase];

for (int i = 0; i < nrClase; i++)
{
bayesProb[i] = 1;

int occ = 0;
for (int j = 0; j < nrLin; j++)
{
if (date[j, nrCol] == classLabel[i])
occ++;
}
classProb[i] = (double)occ / (double)nrLin;
bayesProb[i] *= classProb[i];

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

104
for (int j = 0; j < nrCol; j++)
{
int attrOcc = 0;

for (int k = 0; k < nrLin; k++)
{
if (date[k, nrCol] == classLabel[i] && date[k, j] == vect[j])
attrOcc++;
}

// corectia Laplace
double attrProb = (double)(attrOcc + 1) / (double)(occ + nrClase);

bayesProb[i] *= attrProb;
}
}

// argmax

double probMax = bayesProb[0];
int classMax = 0;

for (int i = 1; i < nrClase; i++)
{
if (probMax < bayesProb[i])
{
probMax = bayesProb[i];
classMax = i;
}
}

return classMax;
}


private void ReadData(string filename)
{
date = new string[nrLin, nrCol + 1];

StreamReader sr = new StreamReader(filename);
string all = sr.ReadToEnd();
sr.Close();

string[] toks = all.Split(" \t\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < nrLin; i++)
for (int j = 0; j < nrCol + 1; j++)
date[i, j] = toks[i * (nrCol + 1) + j];
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bayesian

105
}
}
}

n figura 6.4 sunt prezentate rezultatele procesului de clasificare a
celor 30 de documente. n coloanele 2 i 3 sunt afiate produsele
probabilitilor, pentru clasa Ham respectiv Spam, per milion.


Figura 6.4. Rezultatele clasificrii


Aplicaia 6.3. Implementarea algoritmului reelelor bayesiene,
pentru a determina probabilitile marginale ale unor evenimente, n acest
caz existena unor concepte n documentele n care utilizatorul efectueaz o
cutare.


using System;
using System.Collections.Generic;
using System.Text;


namespace BayesNet
{
class Program
{
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

106
static void Main(string[] args)
{
Console.WriteLine("Asia");
BayesianNetwork bnet = new BayesianNetwork("visitasia.txt");
Console.WriteLine(bnet.DisplayResults());

Console.WriteLine("Document 1\r\n");
bnet = new BayesianNetwork("documents1.txt");
Console.WriteLine(bnet.DisplayResults());
Console.WriteLine("Document 2\r\n");
bnet = new BayesianNetwork("documents2.txt");
Console.WriteLine(bnet.DisplayResults());

Console.WriteLine("Document 3\r\n");
bnet = new BayesianNetwork("documents3.txt");
Console.WriteLine(bnet.DisplayResults());
}
}
}


n figura 6.5 se prezint un exemplu de fiier de intrare pentru
program, n care sunt precizate probabilitile marginale ale evenimentelor
independente, iar pentru evenimentele condiionate se precizeaz prinii i
probabilitile condiionate. Problema vizitei n Asia este o problem clasic
pentru reelele bayesiene, propus n S. L. Lauritzen, D. J. Spiegelhalter,
Local computations with probabilities on graphical structures and their
application to expert systems, Journal of Royal Statistics Society B, 50(2),
157-194, 1988.



Figura 6.5. Exemplu de fiier de intrare


O reea bayesian mai complex este prezentat n figura 6.6, n care
o serie de termeni aparin sau nu unor documente, iar termenii la rndul lor
determin o serie de concepte mai generale, pe baza crora se poate realiza
cutarea de ctre utilizator.
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bayesian

107


Figura 6.6. Reea bayesian pentru cutarea conceptelor n documente


Structura din figura 6.6 poate fi notat n formatul adoptat pentru
fiierele de intrare, dup cum se observ n figura 6.7. n funcie de msura
n care anumii termeni determin anumite concepte, putem avea mai multe
situaii de analizat.
Aceste trei fiiere, documents1.txt, documents2.txt i documents3.txt
sunt trimise pentru prelucrare n modulul de calcul al probabilitilor
marginale.



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

108






Figura 6.7. Fiiere de descriere a reelelor bayesiene documente-concepte


n continuare este prezentat clasa care implementeaz calculele
aferente metodei reelelor bayesiene.


using System;
using System.Collections.Generic;
using System.Text;
using System.IO;


namespace BayesNet
{


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bayesian

109
public class BayesianNetwork
{
private string[][] stringArray;
private int lineCount;


public BayesianNetwork(string filename)
{
StreamReader sr = new StreamReader(filename);
string allFile = sr.ReadToEnd();
sr.Close();

string[] lines = allFile.Split("\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
stringArray = new string[lines.Length][];
for (int i = 0; i < lines.Length; i++)
stringArray[i] = lines[i].Split(" \t,;".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
lineCount = lines.Length;
Compute();
}


private void Compute()
{
for (int j = 0; j < lineCount; j++)
ComputeLine(j);
}


private void ComputeLine(int index)
{
int parents = 1;
double rel = 0, val = 0, sval = 1;

if (!IsComputed(index))
{
while (GetIndex(stringArray[index][parents + 1]) != -1)
parents++;

rel = Math.Pow(2, parents);

for (int i = 1; i <= rel; i++)
{
sval = double.Parse(stringArray[index][parents + i]);
for (int j = 1; j <= parents; j++)
if (Convert.ToString(16 + i - 1, 2)[5 - j] == '0')
sval *= double.Parse(stringArray[GetIndex(stringArray[index][j])][1]);
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

110
else
sval *= (1 - double.Parse(
stringArray[GetIndex(stringArray[index][j])][1]));
val += sval;
}

stringArray[index][1] = val.ToString();
}
}


private int GetIndex(string st)
{
for (int i = 0; i < lineCount; i++)
if (st == stringArray[i][0])
return i;
return -1;
}


private bool IsComputed(int index)
{
if (GetIndex(stringArray[index][1]) == -1)
return true;

return false;
}


public string DisplayResults()
{
string result = "";
for (int j = 0; j < lineCount; j++)
result += stringArray[j][0] + " -> " + stringArray[j][1] + "\r\n";
return result;
}
}
}


Rezultatele rulrii programului sunt prezentate n figura 6.8.
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Clasificarea bayesian

111

Figura 6.8. Rezultatele rulrii programului pentru reele bayesiene

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
112
Capitolul 7


PARTIIONAREA K-MEDII


n acest capitol vom prezenta o alt metod de web mining,
partiionarea sau clusterizarea, care permite, spre deosebire de clasificare,
gruparea nesupervizat a obiectelor ntlnite. Vom particulariza metoda
k-mediilor (engl. k-means) la problem gruprii documentelor, ceea ce
poate fi un pas intermediar naintea atribuirii de semnificaie fiecrui grup
descoperit.


Aplicaia 7.1. Aplicaie grafic ce permite vizualizarea dinamic a
procesului de pariionare asupra unui set de puncte dintr-o regiune
bidimensional.


Figura 7.1. Proiectarea interfeei grafice pentru animaia de partiionare


id252419171 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

113
using System;
using System.Drawing;
using System.Windows.Forms;


namespace kmeans
{
public partial class Form1 : Form
{
public struct Instance
{
public int x, y;
public int cluster;
}

Instance[] objects, means;
int noPoints, noClusters;


public Form1()
{
InitializeComponent();
comboBoxNoClus.SelectedIndex = 2; // initial 4 clustere

objects = null;
means = null;

SetStyle(ControlStyles.AllPaintingInWmPaint | ControlStyles.DoubleBuffer |
ControlStyles.UserPaint, true);
}


private void buttonGen_Click(object sender, EventArgs e)
{
// Citirea configuratiei
// Generarea punctelor initiale

try
{
noPoints = Convert.ToInt32(textBoxNoPoint.Text);
}
catch (Exception exc)
{
MessageBox.Show(exc.Message);
return;
}

noClusters = comboBoxNoClus.SelectedIndex + 2;
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

114
objects = new Instance[noPoints];
means = new Instance[noClusters];
Random r = new Random();

for (int i = 0; i < noPoints; i++)
{
bool ok = false;
while (!ok)
{
objects[i].x = 5 + r.Next(390);
objects[i].y = 5 + r.Next(390);
int dx = objects[i].x - 200;
int dy = objects[i].y - 200;

if (dx * dx + dy * dy < 200 * 200)
ok = true;
}
objects[i].cluster = -1;
}

AssignClusters();

pictureBox.Refresh();
}


private void pictureBox_Paint(object sender, PaintEventArgs e)
{
if (objects == null || means == null)
return;

e.Graphics.Clear(Color.White);
Color color = Color.Black;

for (int i = 0; i < noPoints; i++)
{
switch (objects[i].cluster)
{
case 0: color = Color.Blue; break;
case 1: color = Color.Red; break;
case 2: color = Color.DarkGray; break;
case 3: color = Color.Yellow; break;
case 4: color = Color.Aqua; break;
case 5: color = Color.Fuchsia; break;
case 6: color = Color.Orange; break;
case 7: color = Color.DarkSeaGreen; break;
case 8: color = Color.Maroon; break;
case 9: color = Color.GhostWhite; break;
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

115
}

e.Graphics.FillEllipse(new SolidBrush(color), objects[i].x - 4, objects[i].y - 4, 8, 8);
}

for (int i = 0; i < noClusters; i++)
{
means[i] = ComputeMean(i);
e.Graphics.FillRectangle(new SolidBrush(Color.Black),
means[i].x - 4, means[i].y - 4, 8, 8);
}
}


private Instance ComputeMean(int cluster)
{
Instance avg = new Instance();
int n = 0;

for (int i = 0; i < noPoints; i++)
if (objects[i].cluster == cluster)
{
avg.x += objects[i].x;
avg.y += objects[i].y;
n++;
}

if (n == 0)
{
MessageBox.Show("One null cluster");
Close();
}

avg.x /= n;
avg.y /= n;

return avg;
}


private double Distance(Instance f1, Instance f2)
{
int q = 2; // Euclidiana
//int q = 1; // Manhattan
//int q = 100; // Maximum

if (q == 1)
return Math.Abs(f2.x - f1.x) + Math.Abs(f2.y - f1.y);
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

116
else if (q == 2)
return Math.Sqrt((f2.x - f1.x) * (f2.x - f1.x) + (f2.y - f1.y) * (f2.y - f1.y));
else
{
// Distanta Minkovski
double d = Math.Pow(f2.x - f1.x, q) + Math.Pow(f2.y - f1.y, q);
return Math.Pow(d, 1 / q);
}
}


/// <summary>
/// Returns true if at least one reassignment has been made
/// </summary>
/// <returns></returns>
private bool ReassignClusters()
{
for (int i = 0; i < noClusters; i++)
means[i] = ComputeMean(i);

bool reass = false;

for (int i = 0; i < noPoints; i++)
{
double minDist = 1e+10; int minDistCluster = -1;
for (int j = 0; j < noClusters; j++)
{
double dist = Distance(objects[i], means[j]);
if (dist < minDist)
{
minDist = dist;
minDistCluster = j;
}
}

if (objects[i].cluster != minDistCluster)
{
reass = true;
objects[i].cluster = minDistCluster;
}
}

return reass;
}




Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

117

private void AssignClusters()
{
for (int i = 0; i < noClusters; i++)
means[i] = objects[i];

for (int i = 0; i < noPoints; i++)
{
double minDist = 1e+10; int minDistCluster = -1;
for (int j = 0; j < noClusters; j++)
{
double dist = Distance(objects[i], means[j]);
if (dist < minDist)
{
minDist = dist;
minDistCluster = j;
}
}
objects[i].cluster = minDistCluster;
}
}


private void buttonCluster_Click(object sender, EventArgs e)
{
Random r = new Random();

for (int i = 0; i < noPoints; i++)
objects[i].cluster = r.Next(noClusters);

int gen = 0;

AssignClusters();

while (true)
{
Application.DoEvents();
bool more = ReassignClusters();
if (gen > 2 && !more)
break;
pictureBox.Refresh();
gen++;
}
}
}
}

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

118


Figura 7.2. Partiiile iniiale



Figura 7.3. Partiiile descoperite dup aplicarea algoritmului k-medii
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

119
Aplicaia 7.2. Utilizarea algoritmului k-medii pentru partiionarea
unui set de documente utiliznd distana cosinus ntre documente.



Figura 7.4. Proiectarea interfeei grafice pentru aplicaia de partiionare a documentelor


using System;
using System.Windows.Forms;
using System.IO;


namespace DocClustering
{
public partial class Form1 : Form
{
int nrCol, nrLin;
double[,] date;

Instance[] objects, means;
int noPoints, noClusters;


public Form1()
{
InitializeComponent();
}


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

120
private void kMeansToolStripMenuItem_Click(object sender, EventArgs e)
{
// Citirea fisierului de vectori

nrCol = 20;
nrLin = 30;
date = new double[nrLin, nrCol];

StreamReader sr = new StreamReader("doc_vectors.txt");
string all = sr.ReadToEnd();
sr.Close();

string[] toks = all.Split(" \t\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < nrLin; i++)
for (int j = 0; j < nrCol; j++)
date[i, j] = Convert.ToDouble(toks[i * nrCol + j]);

objects = new Instance[nrLin];
for (int i = 0; i < nrLin; i++)
{
objects[i] = new Instance(nrCol);

for (int j = 0; j < nrCol; j++)
objects[i].val[j] = date[i, j];
}

int[] clusters;
textBox.Clear();

double maxSC = double.MinValue;
double maxSCIndex = -1;

for (int nrcl = 2; nrcl <= 10; nrcl++)
{
for (int k = 0; k < 5; k++)
{
KMeans(nrcl, out clusters);

textBox.Text += nrcl.ToString() + " clustere\r\n";
for (int i = 0; i < nrLin; i++)
textBox.Text += "Doc" + (i + 1) + ": " + clusters[i] + " ";

textBox.Text += "\r\n\r\n";

double sc = SilhouetteCoefficient(clusters);
textBox.Text += "Silhouette coefficient: " + sc.ToString("F3") + "\r\n\r\n";

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

121
Application.DoEvents();

if (maxSC < sc)
{
maxSC = sc;
maxSCIndex = nrcl;
}
}
}
}


private void KMeans(int noClusters, out int[] assign)
{
assign = new int[nrLin];

noPoints = nrLin;
this.noClusters = noClusters;

means = new Instance[noClusters];

for (int i = 0; i < noClusters; i++)
means[i] = new Instance(nrCol);

AssignClusters();

bool more = true;
while (more)
{
Application.DoEvents();
more = ReassignClusters();
}

for (int i = 0; i < nrLin; i++)
{
assign[i] = objects[i].cluster;
}
}


private void AssignClusters()
{
Random r = new Random();

bool[] selected = new bool[noPoints];
for (int i = 0; i < noPoints; i++)
selected[i] = false;

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

122
for (int i = 0; i < noClusters; i++)
{
bool ok = false;
while (!ok)
{
int crtSel = r.Next(noPoints);
if (!selected[crtSel])
{
selected[crtSel] = true;
ok = true;
}
}
}

int clusterIndex = 0;

for (int i = 0; i < noPoints; i++)
{
if (selected[i])
{
means[clusterIndex] = objects[i];
means[clusterIndex].cluster = clusterIndex;
clusterIndex++;
}
}

for (int i = 0; i < noPoints; i++)
{
double minDist = 1e+10; int minDistCluster = -1;
for (int j = 0; j < noClusters; j++)
{
double dist = Distance(objects[i], means[j]);
if (dist < minDist)
{
minDist = dist;
minDistCluster = j;
}
}

objects[i].cluster = minDistCluster;
}
}


private bool ReassignClusters()
{
// returneaza true daca s-a realizat cel putin o modificare

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

123
for (int i = 0; i < noClusters; i++)
means[i] = ComputeMean(i);

bool reass = false;

for (int i = 0; i < noPoints; i++)
{
double minDist = 1e+10; int minDistCluster = -1;
for (int j = 0; j < noClusters; j++)
{
double dist = Distance(objects[i], means[j]);
if (dist < minDist)
{
minDist = dist;
minDistCluster = j;
}
}

if (objects[i].cluster != minDistCluster)
{
reass = true;
objects[i].cluster = minDistCluster;
}
}

return reass;
}


private Instance ComputeMean(int cluster)
{
Instance avg = new Instance(nrCol);
int n = 0;

for (int i = 0; i < noPoints; i++)
if (objects[i].cluster == cluster)
{
for (int j = 0; j < avg.val.Length; j++)
avg.val[j] += objects[i].val[j];
n++;
}

if (n == 0)
{
MessageBox.Show("One null cluster");
Close();
}

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

124
for (int j = 0; j < avg.val.Length; j++)
avg.val[j] /= n;

return avg;
}


private double Distance(Instance f1, Instance f2)
{
double dotProduct = 0; // produs scalar
for (int i = 0; i < nrCol; i++)
dotProduct += f1.val[i] * f2.val[i];

double modA = 0, modB = 0; // modulele vectorilor

for (int i = 0; i < nrCol; i++)
{
modA += f1.val[i] * f1.val[i];
modB += f2.val[i] * f2.val[i];
}

double dp2 = dotProduct * dotProduct;

//modA = Math.Sqrt(modA);
//modB = Math.Sqrt(modB);

double raport = dp2 / (modA * modB);
// Cosine similarity
double cs = Math.Acos(Math.Sqrt(raport));
return cs;
}


/// <summary>
/// Cu cat e mai mare (max. 1), cu atat e mai buna clusterizarea
/// </summary>
/// <param name="assignments"></param>
/// <returns></returns>
public double SilhouetteCoefficient(int[] assignments)
{
double[] c = new double[noPoints];

for (int i = 0; i < noPoints; i++)
{
double a = 0, b = 0;
int n1 = 0, n2 = 0;
double sum1 = 0, sum2 = 0;

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

125
for (int j = 0; j < noPoints; j++)
{
if (i == j)
continue;
if (objects[i].cluster == objects[j].cluster)
{
sum1 += Distance(objects[i], objects[j]);
n1++;
}
else // diferit
{
sum2 += Distance(objects[i], objects[j]);
n2++;
}

if (n1 > 0)
a = sum1 / n1;
if (n2 > 0)
b = sum2 / n2;
}

if (Math.Max(a, b) == 0)
c[i] = 1;
else
c[i] = (b - a) / Math.Max(a, b);
}

double sum_all = 0;
for (int i = 0; i < noPoints; i++)
sum_all += c[i];

return sum_all / noPoints;
}


private void exitToolStripMenuItem_Click(object sender, EventArgs e)
{
Close();
}
}


public class Instance
{
public double[] val;
public int cluster;


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

126
public Instance(int size)
{
val = new double[size];
cluster = -1;
}
}
}



Figura 7.5. Vectorii de frecven ai cuvintelor documentelor de partiionat


Deoarece alegerea mediilor iniiale este aleatorie, rezultatele
algoritmului nu sunt deterministe; pentru aceleai obiecte algoritmul poate
ajunge la rezultate diferite. Ca msur a calitii unei partiionri s-a utilizat
coeficientul de siluet, bazat pe calculul distanei medii dintre obiectele
aparinnd aceleiai partiii i al distanei medii dintre obiecte din partiii
diferite. Cu ct acest coeficient este mai mare, cu att partiionarea este mai
natural sau mai bun, adic se minimizeaz distana din interiorul
partiiilor i se maximizeaz distana dintre partiii.
n figura 7.6 se prezint rezultatele rulrii programului n interfaa
grafic, iar n continuare se detaliaz toate rezultatele partiionrii, cu un
numr de partiii ntre 2 i 10, iar pentru fiecare considerndu-se 5 rulri
distincte i calculndu-se coeficientul de siluet al fiecrei partiionri.


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

127

Figura 7.6. Rezultatele rulrii programului


2 clustere
Doc1: 0 Doc2: 0 Doc3: 0 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 0
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 1 Doc17: 0 Doc18: 0 Doc19: 1 Doc20: 1 Doc21: 0
Doc22: 1 Doc23: 1 Doc24: 0 Doc25: 0 Doc26: 0 Doc27: 0 Doc28: 0 Doc29: 0 Doc30: 1

Silhouette coefficient: 0.229

2 clustere
Doc1: 0 Doc2: 0 Doc3: 0 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 0
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 1 Doc17: 1 Doc18: 1 Doc19: 1 Doc20: 1 Doc21: 1
Doc22: 1 Doc23: 1 Doc24: 1 Doc25: 1 Doc26: 0 Doc27: 1 Doc28: 1 Doc29: 0 Doc30: 1

Silhouette coefficient: 0.253

2 clustere
Doc1: 1 Doc2: 0 Doc3: 0 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 1
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 1 Doc17: 1 Doc18: 1 Doc19: 1 Doc20: 1 Doc21: 1
Doc22: 1 Doc23: 0 Doc24: 1 Doc25: 1 Doc26: 0 Doc27: 1 Doc28: 1 Doc29: 1 Doc30: 1

Silhouette coefficient: 0.226

2 clustere
Doc1: 0 Doc2: 1 Doc3: 1 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 0
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 0 Doc17: 0 Doc18: 0 Doc19: 0 Doc20: 0 Doc21: 0
Doc22: 0 Doc23: 1 Doc24: 0 Doc25: 0 Doc26: 0 Doc27: 0 Doc28: 0 Doc29: 0 Doc30: 0

Silhouette coefficient: 0.208

2 clustere
Doc1: 1 Doc2: 0 Doc3: 0 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 1 Doc17: 1 Doc18: 1 Doc19: 1 Doc20: 1 Doc21: 1
Doc22: 1 Doc23: 1 Doc24: 1 Doc25: 1 Doc26: 1 Doc27: 1 Doc28: 1 Doc29: 1 Doc30: 1

Silhouette coefficient: 0.184

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

128
3 clustere
Doc1: 0 Doc2: 1 Doc3: 1 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 0
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 2 Doc17: 0 Doc18: 0 Doc19: 2 Doc20: 2 Doc21: 0
Doc22: 2 Doc23: 2 Doc24: 2 Doc25: 0 Doc26: 1 Doc27: 0 Doc28: 2 Doc29: 0 Doc30: 2

Silhouette coefficient: 0.289

3 clustere
Doc1: 2 Doc2: 0 Doc3: 0 Doc4: 2 Doc5: 2 Doc6: 2 Doc7: 2 Doc8: 2 Doc9: 2 Doc10: 2 Doc11: 2
Doc12: 2 Doc13: 2 Doc14: 2 Doc15: 2 Doc16: 1 Doc17: 1 Doc18: 1 Doc19: 1 Doc20: 1 Doc21: 1
Doc22: 1 Doc23: 1 Doc24: 1 Doc25: 1 Doc26: 2 Doc27: 1 Doc28: 1 Doc29: 1 Doc30: 1

Silhouette coefficient: 0.274

3 clustere
Doc1: 0 Doc2: 0 Doc3: 0 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 1
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 1 Doc17: 1 Doc18: 1 Doc19: 1 Doc20: 1 Doc21: 1
Doc22: 1 Doc23: 2 Doc24: 1 Doc25: 1 Doc26: 0 Doc27: 1 Doc28: 1 Doc29: 1 Doc30: 1

Silhouette coefficient: 0.289

3 clustere
Doc1: 0 Doc2: 0 Doc3: 0 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 1
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 1 Doc17: 1 Doc18: 1 Doc19: 1 Doc20: 1 Doc21: 1
Doc22: 1 Doc23: 2 Doc24: 1 Doc25: 1 Doc26: 0 Doc27: 1 Doc28: 1 Doc29: 1 Doc30: 1

Silhouette coefficient: 0.289

3 clustere
Doc1: 0 Doc2: 0 Doc3: 0 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 2
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 2 Doc17: 2 Doc18: 2 Doc19: 2 Doc20: 2 Doc21: 2
Doc22: 2 Doc23: 1 Doc24: 2 Doc25: 2 Doc26: 0 Doc27: 2 Doc28: 2 Doc29: 2 Doc30: 2

Silhouette coefficient: 0.289

4 clustere
Doc1: 0 Doc2: 1 Doc3: 1 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 0
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 2 Doc17: 2 Doc18: 2 Doc19: 2 Doc20: 2 Doc21: 2
Doc22: 2 Doc23: 3 Doc24: 2 Doc25: 2 Doc26: 0 Doc27: 2 Doc28: 2 Doc29: 2 Doc30: 2

Silhouette coefficient: 0.320

4 clustere
Doc1: 0 Doc2: 1 Doc3: 3 Doc4: 3 Doc5: 2 Doc6: 2 Doc7: 3 Doc8: 3 Doc9: 3 Doc10: 3 Doc11: 0
Doc12: 2 Doc13: 2 Doc14: 3 Doc15: 3 Doc16: 0 Doc17: 0 Doc18: 0 Doc19: 0 Doc20: 0 Doc21: 0
Doc22: 0 Doc23: 2 Doc24: 0 Doc25: 0 Doc26: 3 Doc27: 0 Doc28: 0 Doc29: 0 Doc30: 0

Silhouette coefficient: 0.288

4 clustere
Doc1: 1 Doc2: 3 Doc3: 3 Doc4: 1 Doc5: 2 Doc6: 2 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 2
Doc12: 2 Doc13: 2 Doc14: 1 Doc15: 1 Doc16: 2 Doc17: 0 Doc18: 0 Doc19: 2 Doc20: 2 Doc21: 0
Doc22: 2 Doc23: 3 Doc24: 2 Doc25: 0 Doc26: 1 Doc27: 0 Doc28: 2 Doc29: 0 Doc30: 2

Silhouette coefficient: 0.236

4 clustere
Doc1: 1 Doc2: 0 Doc3: 1 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 2 Doc17: 0 Doc18: 0 Doc19: 2 Doc20: 2 Doc21: 2
Doc22: 2 Doc23: 3 Doc24: 2 Doc25: 0 Doc26: 1 Doc27: 2 Doc28: 2 Doc29: 0 Doc30: 2
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

129

Silhouette coefficient: 0.338

4 clustere
Doc1: 0 Doc2: 1 Doc3: 1 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 0
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 2 Doc17: 0 Doc18: 0 Doc19: 2 Doc20: 2 Doc21: 0
Doc22: 2 Doc23: 3 Doc24: 0 Doc25: 0 Doc26: 1 Doc27: 0 Doc28: 0 Doc29: 0 Doc30: 2

Silhouette coefficient: 0.328

5 clustere
Doc1: 1 Doc2: 0 Doc3: 4 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 3 Doc17: 3 Doc18: 3 Doc19: 3 Doc20: 3 Doc21: 3
Doc22: 3 Doc23: 2 Doc24: 3 Doc25: 3 Doc26: 1 Doc27: 3 Doc28: 3 Doc29: 3 Doc30: 3

Silhouette coefficient: 0.390

5 clustere
Doc1: 2 Doc2: 0 Doc3: 1 Doc4: 2 Doc5: 2 Doc6: 2 Doc7: 2 Doc8: 2 Doc9: 2 Doc10: 2 Doc11: 2
Doc12: 2 Doc13: 2 Doc14: 2 Doc15: 2 Doc16: 3 Doc17: 3 Doc18: 3 Doc19: 3 Doc20: 0 Doc21: 3
Doc22: 0 Doc23: 4 Doc24: 3 Doc25: 3 Doc26: 2 Doc27: 3 Doc28: 3 Doc29: 3 Doc30: 3

Silhouette coefficient: 0.371

5 clustere
Doc1: 0 Doc2: 0 Doc3: 0 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 0
Doc12: 4 Doc13: 4 Doc14: 0 Doc15: 0 Doc16: 1 Doc17: 2 Doc18: 2 Doc19: 1 Doc20: 1 Doc21: 2
Doc22: 1 Doc23: 3 Doc24: 4 Doc25: 2 Doc26: 0 Doc27: 1 Doc28: 4 Doc29: 2 Doc30: 1

Silhouette coefficient: 0.321

5 clustere
Doc1: 3 Doc2: 0 Doc3: 1 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 3
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 4 Doc17: 3 Doc18: 3 Doc19: 4 Doc20: 4 Doc21: 3
Doc22: 4 Doc23: 2 Doc24: 4 Doc25: 3 Doc26: 1 Doc27: 3 Doc28: 4 Doc29: 3 Doc30: 4

Silhouette coefficient: 0.387

5 clustere
Doc1: 1 Doc2: 0 Doc3: 0 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 3 Doc17: 4 Doc18: 4 Doc19: 3 Doc20: 3 Doc21: 4
Doc22: 3 Doc23: 2 Doc24: 3 Doc25: 4 Doc26: 1 Doc27: 4 Doc28: 3 Doc29: 4 Doc30: 3

Silhouette coefficient: 0.373

6 clustere
Doc1: 1 Doc2: 0 Doc3: 0 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 2 Doc17: 3 Doc18: 5 Doc19: 2 Doc20: 2 Doc21: 3
Doc22: 2 Doc23: 4 Doc24: 3 Doc25: 3 Doc26: 1 Doc27: 5 Doc28: 3 Doc29: 3 Doc30: 3

Silhouette coefficient: 0.382

6 clustere
Doc1: 0 Doc2: 1 Doc3: 1 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 0 Doc11: 0
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 2 Doc17: 3 Doc18: 3 Doc19: 4 Doc20: 2 Doc21: 3
Doc22: 2 Doc23: 5 Doc24: 4 Doc25: 3 Doc26: 0 Doc27: 3 Doc28: 4 Doc29: 3 Doc30: 4

Silhouette coefficient: 0.424


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

130
6 clustere
Doc1: 3 Doc2: 0 Doc3: 4 Doc4: 2 Doc5: 1 Doc6: 1 Doc7: 2 Doc8: 3 Doc9: 2 Doc10: 3 Doc11: 3
Doc12: 1 Doc13: 1 Doc14: 3 Doc15: 2 Doc16: 5 Doc17: 5 Doc18: 5 Doc19: 5 Doc20: 5 Doc21: 5
Doc22: 5 Doc23: 4 Doc24: 5 Doc25: 5 Doc26: 2 Doc27: 5 Doc28: 5 Doc29: 3 Doc30: 5

Silhouette coefficient: 0.351

6 clustere
Doc1: 1 Doc2: 5 Doc3: 5 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 0 Doc17: 3 Doc18: 3 Doc19: 0 Doc20: 0 Doc21: 3
Doc22: 0 Doc23: 4 Doc24: 2 Doc25: 3 Doc26: 1 Doc27: 3 Doc28: 2 Doc29: 3 Doc30: 2

Silhouette coefficient: 0.401

6 clustere
Doc1: 5 Doc2: 0 Doc3: 0 Doc4: 1 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 3 Doc17: 5 Doc18: 5 Doc19: 2 Doc20: 3 Doc21: 2
Doc22: 3 Doc23: 4 Doc24: 2 Doc25: 5 Doc26: 1 Doc27: 5 Doc28: 2 Doc29: 5 Doc30: 2

Silhouette coefficient: 0.421

7 clustere
Doc1: 2 Doc2: 1 Doc3: 1 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 2 Doc11: 2
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 3 Doc17: 2 Doc18: 2 Doc19: 3 Doc20: 3 Doc21: 5
Doc22: 3 Doc23: 4 Doc24: 3 Doc25: 5 Doc26: 0 Doc27: 6 Doc28: 6 Doc29: 2 Doc30: 3

Silhouette coefficient: 0.382

7 clustere
Doc1: 2 Doc2: 1 Doc3: 3 Doc4: 2 Doc5: 4 Doc6: 4 Doc7: 2 Doc8: 2 Doc9: 2 Doc10: 2 Doc11: 2
Doc12: 4 Doc13: 4 Doc14: 2 Doc15: 2 Doc16: 5 Doc17: 0 Doc18: 0 Doc19: 5 Doc20: 5 Doc21: 0
Doc22: 5 Doc23: 6 Doc24: 0 Doc25: 0 Doc26: 2 Doc27: 0 Doc28: 0 Doc29: 0 Doc30: 5

Silhouette coefficient: 0.456

7 clustere
Doc1: 3 Doc2: 0 Doc3: 1 Doc4: 3 Doc5: 2 Doc6: 2 Doc7: 3 Doc8: 3 Doc9: 3 Doc10: 3 Doc11: 3
Doc12: 2 Doc13: 2 Doc14: 3 Doc15: 3 Doc16: 4 Doc17: 6 Doc18: 6 Doc19: 4 Doc20: 4 Doc21: 6
Doc22: 4 Doc23: 5 Doc24: 4 Doc25: 6 Doc26: 3 Doc27: 6 Doc28: 4 Doc29: 6 Doc30: 4

Silhouette coefficient: 0.464

7 clustere
Doc1: 6 Doc2: 0 Doc3: 0 Doc4: 3 Doc5: 2 Doc6: 2 Doc7: 6 Doc8: 3 Doc9: 6 Doc10: 6 Doc11: 6
Doc12: 2 Doc13: 2 Doc14: 3 Doc15: 3 Doc16: 1 Doc17: 4 Doc18: 4 Doc19: 1 Doc20: 1 Doc21: 4
Doc22: 1 Doc23: 5 Doc24: 2 Doc25: 4 Doc26: 6 Doc27: 4 Doc28: 2 Doc29: 4 Doc30: 1

Silhouette coefficient: 0.352

7 clustere
Doc1: 2 Doc2: 0 Doc3: 3 Doc4: 1 Doc5: 2 Doc6: 2 Doc7: 2 Doc8: 2 Doc9: 2 Doc10: 2 Doc11: 2
Doc12: 2 Doc13: 2 Doc14: 2 Doc15: 2 Doc16: 4 Doc17: 6 Doc18: 6 Doc19: 4 Doc20: 4 Doc21: 6
Doc22: 4 Doc23: 5 Doc24: 6 Doc25: 6 Doc26: 2 Doc27: 6 Doc28: 6 Doc29: 6 Doc30: 6

Silhouette coefficient: 0.469

8 clustere
Doc1: 1 Doc2: 6 Doc3: 6 Doc4: 0 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 2 Doc17: 3 Doc18: 3 Doc19: 5 Doc20: 4 Doc21: 5
Doc22: 4 Doc23: 7 Doc24: 3 Doc25: 3 Doc26: 1 Doc27: 3 Doc28: 3 Doc29: 3 Doc30: 5
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Partiionarea k-medii

131

Silhouette coefficient: 0.466

8 clustere
Doc1: 3 Doc2: 1 Doc3: 2 Doc4: 0 Doc5: 0 Doc6: 0 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 3 Doc11: 3
Doc12: 0 Doc13: 0 Doc14: 0 Doc15: 0 Doc16: 6 Doc17: 3 Doc18: 3 Doc19: 4 Doc20: 6 Doc21: 5
Doc22: 6 Doc23: 7 Doc24: 4 Doc25: 5 Doc26: 0 Doc27: 6 Doc28: 4 Doc29: 3 Doc30: 4

Silhouette coefficient: 0.485

8 clustere
Doc1: 1 Doc2: 2 Doc3: 2 Doc4: 0 Doc5: 1 Doc6: 1 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 1
Doc12: 1 Doc13: 1 Doc14: 1 Doc15: 1 Doc16: 4 Doc17: 3 Doc18: 3 Doc19: 5 Doc20: 4 Doc21: 3
Doc22: 5 Doc23: 6 Doc24: 7 Doc25: 3 Doc26: 1 Doc27: 3 Doc28: 7 Doc29: 3 Doc30: 7

Silhouette coefficient: 0.432

8 clustere
Doc1: 4 Doc2: 0 Doc3: 1 Doc4: 5 Doc5: 3 Doc6: 3 Doc7: 4 Doc8: 5 Doc9: 4 Doc10: 5 Doc11: 4
Doc12: 3 Doc13: 3 Doc14: 5 Doc15: 5 Doc16: 2 Doc17: 6 Doc18: 6 Doc19: 2 Doc20: 2 Doc21: 6
Doc22: 2 Doc23: 7 Doc24: 6 Doc25: 6 Doc26: 4 Doc27: 6 Doc28: 6 Doc29: 6 Doc30: 6

Silhouette coefficient: 0.458

8 clustere
Doc1: 7 Doc2: 1 Doc3: 4 Doc4: 0 Doc5: 3 Doc6: 3 Doc7: 3 Doc8: 3 Doc9: 3 Doc10: 3 Doc11: 7
Doc12: 3 Doc13: 3 Doc14: 3 Doc15: 3 Doc16: 2 Doc17: 5 Doc18: 5 Doc19: 2 Doc20: 2 Doc21: 5
Doc22: 2 Doc23: 6 Doc24: 5 Doc25: 5 Doc26: 3 Doc27: 5 Doc28: 5 Doc29: 7 Doc30: 5

Silhouette coefficient: 0.485

9 clustere
Doc1: 7 Doc2: 0 Doc3: 1 Doc4: 2 Doc5: 3 Doc6: 3 Doc7: 7 Doc8: 4 Doc9: 7 Doc10: 4 Doc11: 7
Doc12: 3 Doc13: 3 Doc14: 4 Doc15: 4 Doc16: 6 Doc17: 8 Doc18: 8 Doc19: 6 Doc20: 6 Doc21: 8
Doc22: 6 Doc23: 5 Doc24: 3 Doc25: 8 Doc26: 7 Doc27: 8 Doc28: 3 Doc29: 8 Doc30: 6

Silhouette coefficient: 0.464

9 clustere
Doc1: 1 Doc2: 2 Doc3: 2 Doc4: 5 Doc5: 0 Doc6: 1 Doc7: 3 Doc8: 3 Doc9: 3 Doc10: 3 Doc11: 1
Doc12: 0 Doc13: 0 Doc14: 3 Doc15: 3 Doc16: 6 Doc17: 7 Doc18: 8 Doc19: 6 Doc20: 6 Doc21: 8
Doc22: 6 Doc23: 4 Doc24: 6 Doc25: 8 Doc26: 3 Doc27: 7 Doc28: 7 Doc29: 8 Doc30: 6

Silhouette coefficient: 0.431

9 clustere
Doc1: 5 Doc2: 4 Doc3: 1 Doc4: 2 Doc5: 3 Doc6: 3 Doc7: 3 Doc8: 5 Doc9: 3 Doc10: 5 Doc11: 3
Doc12: 3 Doc13: 3 Doc14: 5 Doc15: 3 Doc16: 7 Doc17: 8 Doc18: 8 Doc19: 7 Doc20: 7 Doc21: 8
Doc22: 7 Doc23: 6 Doc24: 0 Doc25: 8 Doc26: 3 Doc27: 8 Doc28: 0 Doc29: 8 Doc30: 0

Silhouette coefficient: 0.503

9 clustere
Doc1: 5 Doc2: 0 Doc3: 0 Doc4: 5 Doc5: 4 Doc6: 4 Doc7: 5 Doc8: 5 Doc9: 5 Doc10: 5 Doc11: 2
Doc12: 3 Doc13: 4 Doc14: 5 Doc15: 5 Doc16: 2 Doc17: 1 Doc18: 1 Doc19: 6 Doc20: 2 Doc21: 6
Doc22: 2 Doc23: 7 Doc24: 6 Doc25: 1 Doc26: 5 Doc27: 2 Doc28: 8 Doc29: 1 Doc30: 6

Silhouette coefficient: 0.467


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea coninutului

132
9 clustere
Doc1: 0 Doc2: 2 Doc3: 5 Doc4: 1 Doc5: 4 Doc6: 4 Doc7: 6 Doc8: 6 Doc9: 6 Doc10: 6 Doc11: 0
Doc12: 4 Doc13: 4 Doc14: 6 Doc15: 6 Doc16: 7 Doc17: 0 Doc18: 0 Doc19: 8 Doc20: 7 Doc21: 1
Doc22: 7 Doc23: 3 Doc24: 8 Doc25: 1 Doc26: 6 Doc27: 0 Doc28: 8 Doc29: 0 Doc30: 8

Silhouette coefficient: 0.535

10 clustere
Doc1: 0 Doc2: 7 Doc3: 1 Doc4: 3 Doc5: 3 Doc6: 3 Doc7: 3 Doc8: 3 Doc9: 3 Doc10: 3 Doc11: 2
Doc12: 3 Doc13: 3 Doc14: 3 Doc15: 3 Doc16: 2 Doc17: 5 Doc18: 5 Doc19: 6 Doc20: 8 Doc21: 6
Doc22: 8 Doc23: 4 Doc24: 6 Doc25: 5 Doc26: 3 Doc27: 8 Doc28: 9 Doc29: 5 Doc30: 6

Silhouette coefficient: 0.531

10 clustere
Doc1: 8 Doc2: 2 Doc3: 0 Doc4: 1 Doc5: 3 Doc6: 3 Doc7: 3 Doc8: 3 Doc9: 3 Doc10: 3 Doc11: 3
Doc12: 3 Doc13: 3 Doc14: 3 Doc15: 3 Doc16: 4 Doc17: 8 Doc18: 8 Doc19: 5 Doc20: 6 Doc21: 5
Doc22: 6 Doc23: 7 Doc24: 9 Doc25: 1 Doc26: 3 Doc27: 8 Doc28: 9 Doc29: 8 Doc30: 5

Silhouette coefficient: 0.543

10 clustere
Doc1: 2 Doc2: 0 Doc3: 4 Doc4: 1 Doc5: 3 Doc6: 3 Doc7: 1 Doc8: 1 Doc9: 1 Doc10: 1 Doc11: 2
Doc12: 3 Doc13: 3 Doc14: 1 Doc15: 1 Doc16: 5 Doc17: 2 Doc18: 7 Doc19: 9 Doc20: 5 Doc21: 9
Doc22: 5 Doc23: 6 Doc24: 8 Doc25: 7 Doc26: 1 Doc27: 2 Doc28: 8 Doc29: 2 Doc30: 9

Silhouette coefficient: 0.557

10 clustere
Doc1: 4 Doc2: 3 Doc3: 0 Doc4: 3 Doc5: 1 Doc6: 1 Doc7: 3 Doc8: 3 Doc9: 3 Doc10: 3 Doc11: 4
Doc12: 2 Doc13: 2 Doc14: 3 Doc15: 3 Doc16: 6 Doc17: 4 Doc18: 4 Doc19: 6 Doc20: 6 Doc21: 5
Doc22: 6 Doc23: 7 Doc24: 9 Doc25: 8 Doc26: 3 Doc27: 4 Doc28: 9 Doc29: 4 Doc30: 5

Silhouette coefficient: 0.500

10 clustere
Doc1: 2 Doc2: 1 Doc3: 1 Doc4: 0 Doc5: 3 Doc6: 3 Doc7: 0 Doc8: 0 Doc9: 0 Doc10: 2 Doc11: 2
Doc12: 3 Doc13: 3 Doc14: 0 Doc15: 0 Doc16: 4 Doc17: 5 Doc18: 5 Doc19: 8 Doc20: 6 Doc21: 8
Doc22: 6 Doc23: 9 Doc24: 7 Doc25: 5 Doc26: 0 Doc27: 5 Doc28: 7 Doc29: 5 Doc30: 8

Silhouette coefficient: 0.510

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
EXPLORAREA
STRUCTURII



Capitolul 8. Motoare de cutare
! Indexarea i cutarea ntr-un site

Capitolul 9. Relevana paginilor web
! Algoritmul PageRank


Partea
id252686421 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com


Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
135
Capitolul 8


MOTOARE DE CUTARE


n acest capitol vom descrie implementarea de principiu a unui
motor de cutare. Pentru a demonstra procesul de indexare i cutare a
cuvintelor cheie, se va prezenta o simulare offline care ignora problemele
legate de parcurgerea i descrcarea paginilor reale de pe Internet.

Aplicaia 8.1. Clasa ce descrie o pagin indexat. Deoarece exact
aceeai clas trebui utilizat n dou proiecte diferite, cel de parcurgere i
indexare a site-ului simulat, precum i n cel de gsire a cuvintelor cheie i
afiare a paginilor descoperite, aceast clas va fi compilat ntr-un DLL ce
va fi ulterior refereniat din ambele proiecte.


using System;
using System.Collections.Generic;
using System.Text;


namespace IndexedPage
{
[Serializable]
public class Page
{
public string Title;
public string[] ContentWords;
public string[] Links;

public Page(string fileContent)
{
int titleIndex = fileContent.IndexOf("TITLE");
int contentIndex = fileContent.IndexOf("CONTENT");
int linksIndex = fileContent.IndexOf("LINKS");

Title = fileContent.Substring(titleIndex + 5 + 1, contentIndex - titleIndex - 5 - 1);
Title = Title.Trim();

string cnt = fileContent.Substring(contentIndex + 7 + 1,
linksIndex - contentIndex - 7 - 1);
ContentWords = cnt.Split(" \t\r\n{}*-.,;:()[]<>/+~!@#$%^&\"\'".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
id252520203 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea structurii

136
string lnk = fileContent.Substring(linksIndex + 5 + 1);
Links = lnk.Split(" \t\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
}
}
}


Aplicaia 8.2. Implementarea spider-ului care indexeaz cuvintele
din paginile ntlnite. Site-ul simulat se regsete ntr-o structur de fiiere
pe hard-disk-ul local. n cazul de test considerat, structura este cea din figura
8.1, care conine unele informaii de pe site-ul real al autorului:
http://www.ace.tuiasi.ro/~fleon.


Figura 8.1. Structura de fiiere a site-ului considerat


Figura 8.2. Seciunile unui pagini
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Motoare de cutare

137
Pentru a simplifica parcurgerea paginilor, fiecare din fiierele PAG
au trei seciuni distincte: TITLE, CONTENT i LINKS, n care informaiile
sunt date explicit, dup cum se poate observa n figura 8.2. Pentru un spider
real, identificarea acestor seciuni se reduce la analiza etichetelor HTML din
paginile parcurse.


8.2.1. Program.cs Programul principal care ruleaz spider-ul i
salveaz informaiile indexate.


using System;
using System.Collections.Generic;
using System.Text;


namespace Spider
{
class Program
{
static void Main(string[] args)
{
Spider s = new Spider();
s.BeginSearch(@"d:\wm\Site\", "index.pag");
s.SaveIndex();
Console.WriteLine("Saved.");
}
}
}


8.2.2. Spider.cs Implementarea spider-ului.


using System;
using System.Collections.Generic;
using System.IO;
using System.Runtime.Serialization.Formatters.Binary;


namespace Spider
{
public class Spider
{
List<string> processed;
Queue<string> fringe;
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea structurii

138
Dictionary<string, List<string>> wordsToPages;
Dictionary<string, IndexedPage.Page> cache;
string path;


public Spider()
{
processed = new List<string>();
fringe = new Queue<string>();
wordsToPages = new Dictionary<string, List<string>>();
cache = new Dictionary<string, IndexedPage.Page>();
}


public void BeginSearch(string indexPath, string indexName)
{
path = indexPath;

fringe.Enqueue(indexName);

while (fringe.Count > 0)
{
string crtPage = fringe.Dequeue();
Console.WriteLine("Processing: " + path + crtPage);

StreamReader sr = new StreamReader(path + crtPage);
string all = sr.ReadToEnd();
IndexedPage.Page p = new IndexedPage.Page(all);

foreach (string s in p.Links)
{
if (!processed.Contains(s))
fringe.Enqueue(s);
}

// prelucrarea efectiva

// cuvintele
foreach (string wordOrig in p.ContentWords)
{
string word = wordOrig.ToLower();

if (wordsToPages.ContainsKey(word))
{
List<string> pageList = wordsToPages[word];
if (!pageList.Contains(path + crtPage))
pageList.Add(path + crtPage);
wordsToPages.Remove(word);
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Motoare de cutare

139
wordsToPages.Add(word, pageList);
}
else // nu este inca in hashtable
{
List<string> pageList = new List<string>();
pageList.Add(path + crtPage);
wordsToPages.Add(word, pageList);
}
}

// paginile
cache.Add(path + crtPage, p);
processed.Add(crtPage);
}
}


public void SaveIndex()
{
FileStream fs = new FileStream("index.dat", FileMode.OpenOrCreate, FileAccess.Write);
BinaryFormatter bf = new BinaryFormatter();
bf.Serialize(fs, wordsToPages);
fs.Close();

fs = new FileStream("cache.dat", FileMode.OpenOrCreate, FileAccess.Write);
bf.Serialize(fs, cache);
fs.Close();
}
}
}



Figura 8.3. Evoluia procesului de indexare
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea structurii

140
Aplicaia 8.3. Motorul de cutare care ncarc informaiile despre
pagini i cuvintele coninute, primete interogrile utilizatorilor i afieaz
rezultatele cutrilor.


8.3.1. Default.aspx.cs Pagina principal de interogare, n care
utilizatorul introduce irul de cuvinte care trebuie cutate.


Figura 8.4. Proiectarea interfeei paginii de interogare


using System;


public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
Session["Query"] = null;
}

protected void Button_Click(object sender, EventArgs e)
{
Session["Query"] = TextBoxQuery.Text;
Server.Transfer("Results.aspx");
}
}


8.3.2. Results.aspx.cs Afiarea paginilor care conin cuvintele
cutate de ctre utilizator. Se consider paginile care conin toate cuvintele
(se utilizeaz operatorul i logic pentru apartenena cuvintelor).


using System;
using System.Runtime.Serialization.Formatters.Binary;
using System.IO;
using System.Collections.Generic;


public partial class Results : System.Web.UI.Page
{
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Motoare de cutare

141
Dictionary<string, List<string>> wordsToPages;
Dictionary<string, IndexedPage.Page> cache;
string path;


protected void Page_Load(object sender, EventArgs e)
{
if (Session["Query"] == null)
Response.Redirect("Default.aspx");

path = Server.MapPath(".") + "\\";

if (Session["IndexLoaded"] == null)
{
LoadIndex();
Session["IndexLoaded"] = "true";
}
else
{
wordsToPages = Session["WordsToPages"] as
Dictionary<string, List<string>>;
cache = Session["Cache"] as Dictionary<string, IndexedPage.Page>;
}

string query = (string)Session["Query"];
string[] words = query.Split(" \t,;".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);

for (int i = 0; i < words.Length; i++)
words[i] = words[i].ToLower();

bool notFound = false;
List<string> pageList = new List<string>();

if (!wordsToPages.ContainsKey(words[0]))
notFound = true;

if (!notFound)
{
pageList = wordsToPages[words[0]];

for (int i = 1; i < words.Length && !notFound; i++)
{
if (!wordsToPages.ContainsKey(words[i]))
{
notFound = true;
break;
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea structurii

142
List<string> pageListNew = new List<string>();
List<string> pageList1 = wordsToPages[words[i]];

for (int j = 0; j < pageList1.Count; j++)
{
if (pageList.Contains(pageList1[j]))
pageListNew.Add(pageList1[j]);
}

pageList = pageListNew;

if (pageList.Count == 0)
{
notFound = true;
break;
}
}
}

if (notFound)
Response.Redirect("NotFound.aspx");

StreamReader sr = new StreamReader(path + "template.htm");
string template = sr.ReadToEnd();
sr.Close();

sr = new StreamReader(path + "entry.htm");
string entry = sr.ReadToEnd();
sr.Close();

string lines = "";

for (int i = 0; i < pageList.Count; i++)
{
string crtEntry = entry;
IndexedPage.Page p = cache[pageList[i]];
crtEntry = crtEntry.Replace("TITLE", p.Title);
crtEntry = crtEntry.Replace("LINK", pageList[i]);

string textSim = "";
for (int j = 0; j < Math.Min(50, p.ContentWords.Length); j++)
textSim += p.ContentWords[j] + " ";

crtEntry = crtEntry.Replace("TEXT", textSim);

lines += crtEntry;
}

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Motoare de cutare

143
template = template.Replace("ENTRIES", lines);

StreamWriter sw = new StreamWriter(path + "results.htm");
sw.Write(template);
sw.Close();

Response.Redirect("results.htm");
}


private void LoadIndex()
{
FileStream fs = new FileStream(path + "index.dat", FileMode.Open, FileAccess.Read);
BinaryFormatter bf = new BinaryFormatter();
wordsToPages = (Dictionary<string, List<string>>)bf.Deserialize(fs);
fs.Close();

fs = new FileStream(path + "cache.dat", FileMode.Open, FileAccess.Read);
cache = (Dictionary<string, IndexedPage.Page>)bf.Deserialize(fs);
fs.Close();

Session["WordsToPages"] = wordsToPages;
Session["Cache"] = cache;
}
}


8.3.3. template.htm ablonul utilizat pentru afiarea rezultatelor.

<html>

<head>
<title>WMSearch Results</title>
</head>

<body>

<p><b><font size="6">WMSearch</font></b></p>

ENTRIES

<p><a href="Default.aspx">Search again</a></p>

</body>

</html>

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea structurii

144
8.3.4. entry.htm ablonul utilizat pentru afiarea unei intrri n
lista de rezultate, adic o pagin n care s-au gsit cuvintele dorite.

<h2 class="r">&nbsp;</h2>
<h2 class="r" style="margin-top: 0; margin-bottom: 0"><a
href="LINK">TITLE</a></h2>
<p class="r" style="margin-top: 0; margin-bottom: 0">TEXT</p>
<p class="r" style="margin-top: 0; margin-bottom: 0"><font
color="#008000">LINK</font></p>


8.3.5. NotFound.aspx Pagina de eroare n caz c nicio pagin
indexat nu conine cuvintele cutate.


Figura 8.4. Proiectarea interfeei paginii de eroare


n figurile 8.5 i 8.6 se prezint un exemplu de rulare a motorului de
cutare implementat.



Figura 8.5. Introducerea cuvintelor cheie
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Motoare de cutare

145

Figura 8.6. Afiarea rezultatelor motorului de cutare



Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
146
Capitolul 9


RELEVANA PAGINILOR WEB


n acest capitol vom prezenta implementarea iterativ a algoritmului
PageRank, utilizat de Google pentru a determina relevana paginilor web.


Aplicaia 9.1. Algoritmul PageRank


9.1.1. Program.cs Programul principal n care se instaniaz
site-uri web cu diferite topologii i factori de amortizare (d) pentru algoritm
i se apeleaz funcia de calcul.


using System;
using System.Collections.Generic;
using System.Text;


namespace PageRank
{
class Program
{
static void Main(string[] args)
{
WebPage[] site1 = {
new WebPage('A',"B",1), // B -> A
new WebPage('B',"A",1), // A -> B
new WebPage('C',"AB",0)}; // A,B -> C

PageRanker.RankSite(site1, 0.5, 100);

for (int i = 0; i < site1.Length; i++)
Console.WriteLine(site1[i].Name + " -> " + site1[i].PageRank);
Console.WriteLine();

WebPage[] site2 = {
new WebPage('A',"",2), // niciun link spre A
new WebPage('B',"ACDE",1), // A,C,D,E -> B
new WebPage('C',"A",1), // A -> C
new WebPage('D',"BE",1), // B,E -> D
new WebPage('E',"",2)}; // niciun link spre E
PageRanker.RankSite(site2, 0.85, 100); // d este de obicei 0.85
id252601343 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Relevana paginilor web

147
for (int i = 0; i < site2.Length; i++)
Console.WriteLine(site2[i].Name + " -> " + site2[i].PageRank);

Console.WriteLine();

WebPage[] site3 = {
new WebPage('A',"B",2),
new WebPage('B',"A",1),
new WebPage('C',"AD",1),
new WebPage('D',"C",1)};

PageRanker.RankSite(site3, 0.75, 100); // d este de obicei 0.85

for (int i = 0; i < site3.Length; i++)
Console.WriteLine(site3[i].Name + " -> " + site3[i].PageRank);
}
}
}

9.1.2. WebPage.cs Clasa care descrie o pagin din site-ul web.


using System;
using System.Collections.Generic;
using System.Text;


namespace PageRank
{
class WebPage
{
public char Name;
public int Inbound;
public string InboundName;
public int Outbound;
public double PageRank;

public WebPage(char nume, string inbNume, int outb)
{
Name = nume;
Inbound = inbNume.Length;
InboundName = inbNume;
Outbound = outb;
PageRank = 1;
}
}
}
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Explorarea structurii

148
9.1.3. PageRanker.cs Implementarea propriu-zis a algoritmului
PageRank iterativ.


using System;
using System.Collections.Generic;
using System.Text;


namespace PageRank
{
class PageRanker
{
public static void RankSite(WebPage[] site, double d, int noIter)
{
for (int i = 0; i < noIter; i++)
{
for (int j = 0; j < site.Length; j++)
{
double prt = 0;
for (int k = 0; k < site[j].Inbound; k++)
{
char ibn = site[j].InboundName[k];
int index = 0;
while (site[index].Name != ibn)
index++;
prt += site[index].PageRank / (double)site[index].Outbound;
}
site[j].PageRank = (1 - d) + d * prt;
}
}
}
}
}


Rezultatele execuiei programului, coeficienii PageRank ai fiecrei
pagini din site-urile considerate, sunt prezentate n figura 9.1.
Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com
Relevana paginilor web

149

Figura 9.1. Rezultatele execuiei programului

Florin Leon (2008). Explorarea datelor web. Aplicatii, Tehnopress, Iasi, ISBN 978-973-702-530-2
http://florinleon.byethost24.com

S-ar putea să vă placă și