Abstract—the
personalized web page recommendation is much needed these days. Generally,
Web page recommendation systems are implemented in Web servers.
They use data implicitly obtained as a collection of Web
browsing patterns of the users for recommending webpages.
The existing system
collects the Web logs and generates a cluster of
similar users and recommends pages to the user by actively analysing
it in online. However the time complexity for analysing it in online is
more. In order to optimize this and to improve the correctness
of recommendation systems we propose the method of applying Firefly based
algorithm for recommending Web pages along with Naive Bayes
clustering. It clusters Web logs in offline using
Naive Bayes clustering technique. To find the similarity between
the active user queries with other users in the
cluster Firefly algorithm based similarity measure is used. The
proposed approach uses a probability based
clustering which eliminates the odd records while forming clusters.
Firefly algorithm meticulously searches the generated web
logs present in the cluster of the active user and recommends the top
pages. Firefly algorithm utilizes time efficiently, thus it is used for
processing in online. When pages are obtained, they are
ranked and the top pages that are more relevant to
the query are recommended.
The efficiency of the system can be evaluated
using measures like precision, recall-Score, Matthews’s correlation and
Fallout rate. The proposed approach is expected to improve time
utilization in online process as well as recommends
more accurate Webpages.   

 

Introduction- Web
page recommendation system is a sub-domain of recommendation systems that
recommends a set of Web pages to the users based on their past browsing
patterns. It is done by applying special mining techniques on the data that are
previously gathered from the users which in turn discovers and extract
information from Web documents and services. The major concern is to find
reliable and efficient recommendation algorithms. Recommendation system
typically produces the result by following one of the two ways – through
collaborative and content based filtering.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

A.    Collaborative
Filtering 

 

Most
recommendation system has wide use of collaborative filtering for recommending
items. This method lies on collecting and processing the information’s on
user’s behaviours or activities and then predicting the items relating to their
similarity with other users. Collaborative filtering approaches building a
structure from a user’s past behaviours and decisions of other similar users.
This model is used to predict user interested items. Since collaborative
filtering is independent of machine analysable contents, it is capable of
recommending for complex items accurately without “understanding” of the item
itself.

 

 B.    Content
Based Filtering 

Content based
filtering is a widely used approach for designing recommendation systems. This
technique is based on a definition of item and a user’s preferred profile. In a
content based recommendation systems, the keywords are considered as user’s
interest. It utilize a series of distinct property of an item for obtaining and
recommending items with same properties. These approaches are continually
combined as Hybrid Recommendation Systems. These algorithm try to recommend
items based on examining the items that are liked by a user in the past or in
the present. In general, various items of candidate set are compared with items
that are rated by the user in the past and the best matching items are
recommended.

 

 

Literature survey

Recommendation
system plays a vital role in recommending personalized items for the users
based on their interest in a web services. The web
also contains a rich and dynamic information’s. The amount
of information on the web is growing rapidly, as
well as the number of web sites and webpages per web site. Predicting the needs
of a web user as she visits web sites has gained
importance. Many webpage recommendation system
were developed in the past, since they compute recommendations
in online process, their time utilization should
be efficient. A system 4 that uses support vector
machine (SVM) learning based model was
developed for computing similarity between two items
which performed better than latent
factor approach for group recommendations. Since the
matrix representation was followed, the
data sparsity problem was solved.
However, the system was not able
to stably scale when size of the group
dynamically increased.

 

Hybrid
recommender systems that combines two or more
recommendation techniques was designed 5. It
eliminates any weakness which exist when only one recommender
system is used. There are several ways in which the systems can be
combined, such as weighted hybrid recommender where the score of a recommended item is
computed from the results of all of the available
recommendation techniques present in the system. However, data sparseness was
still a problem, the system may generate week recommendations if
few users have rated the same items and also
the system doesn’t overcome the cold start
problem. Hyperspectral sensors can acquire hundreds ofcontiguous
bands over a wide electromagnetic spectrum for each
pixel. The rich spectral information allows
for distinguishing materials with subtle spectral discrepancy, but
it usually leads to the “curse of
dimensionality”. To address this, an improved firefly algorithm based band
selection method 8 was used.

 

The Firefly
algorithm is an evolutionary optimization algorithm proposed by Yang
13. After the initializations of parameters, the brightness is calculated
with the objective function (2.1), where t is the
maximum iterations, ? is the step size and ? is the
light absorbance of m number of fireflies. The moment states are then evaluated
and the bands are selected. In order to avoid employing an actual classifier
within the band searching process to greatly reduce computational cost,
criterion functions that can gauge class separability are preferred which
provided better results. Firefly algorithm also had
a faster convergence even at the size of the
data is larger To improve the accuracy of similarity measure, a nature
inspired algorithm which is based in the behaviour of
Fireflies wereintroduced 10.We consider separate effects for ratings of
users with similar opinions and conflicting opinions. In order
to generate initial population of fireflies, half of population randomly
generated and the other half of population are randomly generated. Mean
absolute error was chosen as objective function to measure recommendation accuracy which
is obtained by difference between predicted rating and real rating.

 

An optimal
similarity measure via a simple linear combination of values and ratio of
ratings for user-based collaborative filtering provides better results. It
increased speed of finding nearest neighbours of active user and reduce
its computation time. Similarity function equation based on Firefly algorithm
was simpler than the equation used in traditional metrics
therefore, the proposed method provided recommendations
faster than traditional metrics. Graph colouring problems are
generally discrete. Algorithms to discrete problems are
quite complex. 

 

A new algorithm
based on Similarity and discretize firefly algorithm directly without any
other hybrid algorithm was developed 11. It was
adoptable to dynamic graph sizes.  A system for assigning
an electronic document to one or more predefined categories
or classes based on its textual context and use of agglomerative
clustering algorithm was developed 6. This type of
clustering along with sample correlation coefficient as
similarity measure, allowed high indexing term space reduction factor with
a gain of higher classification accuracy.

 

In order to
minimize noise and outlier data, a modified DBSCALE algorithm using Naïve Bayes
has been designed 7. This algorithm is basically a prospect based
utility. This function is used to
estimate the outlier cluster
data and increase the correctness rate of algorithm on given
threshold value. Since Naïve Bayes is a probability based function,
it removes outlier cluster data and increases the correctness rate according to
threshold value. It also computes maximum posterior hypothesis for outlier
data. In order to minimize noise and outlier data, a modified DBSCALE algorithm
using Naïve Bayes has been designed 7. This algorithm is basically a prospect
based utility.

 

This function is
used to increase the
correctness rate of algorithm on given threshold value and to
estimate the outlier cluster data. Since Naïve
Bayes is a probability based
function, it removes outlier cluster data and
increases the correctness rate according to
threshold value. It also computes maximum posterior
hypothesis for outlier data. The memory
based collaborative system uses matrix
based computation and solves data sparsity problem but, scalability
of the system cannot be stable when size of the group dynamically increases.
Hybrid system could be helpful in overcoming
the scalability issue but it again leads to cold start problem.

 

To eliminate outliers as well as overcoming
other two
problems Naive Bayes clustering, a probability based
method was used in past. Firefly algorithm has a faster
convergence and searches all possible subsets with better time
utilization. Thus, to design an efficient recommendation system,
Naïve Bayes method can be followed for clustering in
offline. Since the time complexity should be less, Firefly
algorithm that is more efficient in terms of time
utilization, it can be used for calculating similarity in online. Combination
of these two technique might increase the accuracy of the
recommendation system as well as results in efficient
time utilization.                

 

 

 

 

 III.   Overview of the proposed work 

 

Initially, the web log files are obtained from
the 1 America Online Inc. The log files consists of five
fields i.e. anonymous ID for individual user, query of each user along
with query time, list of URLs which user proceeded and its
rank in the result. These logs are collected
and grouped based on anonymous ID. The URL among all
the users are obtained and its content are downloaded and
processed. The processing of data includes removal of
stop words from the URL’s data and
keyword extraction. Similar users are clustered based on fetched
keywords by using Naïve Bayes clustering technique which provides efficient
clusters compared to clustering by the use of association rules. The created
clusters are given to online component. In online process, when an active user
gives a query, the keywords from the query is extracted. The
similarity between the extracted keywords with the other users
in the same cluster of the active user
is calculated using Firefly similarity measure. The
similarity values are sorted along with the web pages
browsed by similar users in the cluster. The top k web pages are
recommended for the active user
as a result.            

 

 

 

IV. The proposed
work

 

The proposed
system follows a linear process of initially collecting the
web logs and processing them followed by clustering similar users
by Naïve Bayes clustering technique and finally generating
recommendations based on a similarity measure from firefly
algorithm. 

 

A.       
Pre-processing of Web Logs

 

The
web logs are collected form 1 AOL Inc. It consists of 20
million web queries from 650 thousand real users over 3
months. The data set includes anonymous ID, query, query
time, item rank and click URL. The log file contains
many number of users along with the web pages visited by
them. It is validated and separated based on anonymous ID. The user
is separated into individual file using anonymous ID. The content from
the URL are fetched and downloaded.
Those keywords are processed which undergoes stop
words removal and
stemming process. The final keywords are then
extracted. The features like keywords, Timings, Frequency, Click URL and
Revisit are fetched. The user profile is constructed using those
features. The user profile that constructed is based
on the features that are taken
form the user log files.

 

Timing: The timing
that the user spent on that particular URL

Frequency: The
amount of time the user visited the URL

Clickstream: The
number of click stream that are visited by user

Revisit:
Whether the user visited the web page

 

The keywords are
generated from the data fetched form the
URL. Timing for each URL is estimated from
the given date and time by calculating the difference
between the each URL that are searched in a single
day by having some time constraints. Frequency
is hence calculated such that number of times the user
clicked the URL. The clickstreams are those that are
clicked by the user for additional information. The timing
of revisit is calculated such that to decide whether the
user preferred it much or not. Keywords:
Keywords are those which are extracted from the URL.
The information from the URL is hence collected and processed to
obtain features of the user.  

 

 

B.        
Naïve Bayes Clustering 

 

Clustering, also
known as unsupervised classification, is a descriptive task with many
applications. Clustering is decomposition or partition of a data set into
groups such that the object in one group are similar to
each other but as different as possible from the
object in other groups. Three main approach for clustering of data is partition
based clustering, hierarchical clustering and probabilistic model
based clustering. Probabilistic model based clustering is a
soft clustering were an object can be in many cluster
following a probability distribution. A clustering is useful if it produces
some interesting insight in the problem that we
are analysing. Naïve Bayes clustering is also a probabilistic clustering technique
that is based in Bayes theorem with strong independent
assumption between features. The feature variables can
be discrete or continuous. This probabilistic clustering lies on nominal and
numeric variables in the data set and its novelty lies in the use of mixture of
truncated exponential (MTE) densities to model the numeric variables. In Naïve
Bayes clustering the class is the only root variable and all
the attributes are conditionally independent given the class. The
clustering problem reduces to take a data set of instances
and a previously specified number of clusters (k), and work out
each cluster’s distribution and the population distribution between
the clusters. To obtain these parameters the expectation maximization (EM)
algorithm is used. Since Naïve Bayes clustering is
a probability based techniques. The items belongs to the
cluster if and only if it has a relation to it. This helps in
eliminating outlier data in the process of clustering. It also provides proper
clustering with less computations. The given dataset is divided into two parts,
one for the training and other for testing. For each
record in the test and train databases, the distribution of the class
variable is computed. According to the obtained distribution, a value for the
class variable is simulated and inserted in the corresponding cluster. The
log-likelihood of the new model is computed. If it is higher than the initial
model, the process is repeated. Otherwise, the process is stopped,
obtained clusters are returned.  

 

C.   
Optimisation Using Firefly Algorithm

 

Firefly
algorithm is an evolutionary algorithm that is based on the
behaviour of fireflies. Fireflies live in colonies and cooperate for the
survival of the colony. Generally, in order to model the behaviour of
fireflies, three assumptions will always be considered i.e. all fireflies are
homogeneous, Attractiveness of each firefly is related to its level of
brightness, rightness of firefly is determined with an exponential
objective function. Each firefly always emits a kind
of light that by which attracts other fireflies. The amount of accessed
light depends on parameters such as distance and absorption coefficient of the
surroundings. The longer the distance the lesser the amount of accessed light
will be. Also in surroundings with high light absorption coefficient such as
foggy weathers, the intensity of light decreases. The
certain issue is that every firefly regardless of its gender has
always been attracted to and moved toward the brighter firefly.
Firefly has a light intensity of its own. The key concept is, the firefly with
low light intensity is always attracted to the firefly with high light
intensity. This concept can be incorporated for calculating similarity. By
using firefly based similarity measure unique and distinguished results can be
obtained which is a useful feature for ranking. It can deal with highly non-
linear, multi-modal optimization problems naturally and
efficiently. It does not use velocities, and there
is no problem as that associated with velocity in PSO. The
speed of convergence is very high in probability of finding the global
optimized answer. It has the flexibility of integration with other optimization
techniques to form hybrid tools. It does not require a
good initial solution to start its
iteration process. Each web pages visited by
the user i are considered a firefly. The number of user visited the
particular page is assumed as the light intensity of the firefly. The objective
function is formulated based on the frequency and duration. Frequency is
calculated as the ratio to the number of visits per page to the average vests
of all pages.

 

 

The duration is
the ratio of duration of page to the total duration of all the pages visited by
the user. Thus, the objective function can be defined as in equation 5.1
Interest (i)= 2*Frequency (i)*Duration (i) Frequency (i)+Duration (i) (5.1)
  The interest of all users in the cluster is calculated. Then the pages
to be recommended are found by using page rank algorithm 2 on the obtained
result. The results after applying page rank algorithm is given as the
recommended web page to the user.    

 

D. Ranking the Web
Pages

 

The result, set of
web pages obtain should ranked in an order that the user might have higher
interest. Thus, they are
ranked in a sorted order based
on the interest of the active user. The association
rule checks the maximum possible combinations
which provides more accurate pages.

 

 

E.   
Recommendation Process

 

The URL that are
to be recommended will be identified based on ranking and similarity measure.
The similarity measure is calculated among the users by comparing their similar
interest. From the obtained result of pages, page rank algorithm
is used to rank the most relevant pages to the user. Thus, resultant URL’s are
recommended to the users. Hence
the web page that is to be recommended to
the user will be more relevant. The use of Naive Bayes clustering
will eliminate the outliers and Firefly based similarity calculation will
check all the subsets of the clusters.

 

 

   

Author