athenoc's Journal

sirius7

input S = {S1, S2, …, SJ}    // set of J seed words

input F = {F1, F2, …, FK}  // set of K features

 

input m // iterations

input e // entities to be added on every iteration

 

input c // Top c features

output S // set of entities (the lexicon)

 

L1 = feature_counts(false);  //L1 is a hash

i = 0;

 

while i < m do

  L2 = feature_counts(true);

  foreach featureInstance in L2 do

     c1 = L1[featureInstance];

     c2 = L2[featureInstance];

     s = score(c1, c2);

     Add (featureInstance, s) to L;

 

L = select top c features

E= sorted entities

P = all phrases which are potential named entity instances;

 

for each p in P do

  count = 0;

  if p  belongs S then continue;  //add only unique instances

  score = 0;

 

  for each f in F do

     featureInstance = apply f on p ;

     if featureInstance belongs L then

     score += L[featureInstance]; count++;

 

  if count > 0 then Add (p, score/count) to E;

 

Select top e entities from E and add them to S; i++;

 

function feature_counts( flag)

  L=  feature->count

 

  P = all phrases  which are potential named entity instances;

  for each p in P do

      for each f in F do

         featureInstance = apply f on p;

         if flag == FALSE OR p belongs to S  then

         Add featureInstance to L and increment the count;

 

  return(L);

basilisk

Named Entity Extraction is an important part of information extraction from. One of the methods used for NER, requires a gazette of entities.

 

Basilisk (Bootstrapping Approach to Semantic Lexicon Induction using Semantic Knowledge) is a weakly supervised bootstrapping algorithm that automatically generates semantic lexicon (Thelen and Riloff, 2002).

    The input is a set of seed words and a corpus. Seed words are obtained by sorting the corpus on frequency and manually selecting words that could best define the gazette. Selection of seed words is important and a poor choice may be detrimental to the process, since Basilisk relies on extraction patterns. Initially, the semantic lexicon consists only of the seed words.

 

    We start with a set of ‘m’ best extraction patterns derived from the semantic lexicon. The patterns are applied to the corpus and candidate entities are collected in the candidate word pool. The best ‘n’ candidate words are added to the semantic lexicon.

 

    Extraction patterns are scored using RlogF (Riloff, 1996).

image007 image015

 

Where,

  Fi is the number of extracted members

  Ni is the total number of entities extracted by pattern_i

 

The top ‘m’ patterns are added to the pattern pool. After some iterations, no new patterns will be considered, and hence a new pattern must be added in every iteration. This ensures a contribution by at least one pattern to the candidate pool on every iteration.

 

The candidate words are scored using,

Where,

 Pi is the number of patterns that extract   word_i.

 Fj is the number of entities extracted by pattern j.

 

After every iteration, ‘n’ best candidates are added to the lexicon. The pattern pool and the candidate pool are emptied, and the process starts over again.

 

 

 

 

I wrote a simple Perl tool to perform basilisk on a document. Extending it for a corpus is straightforward.

To create a gazette of names of corporations, my seed was

 

@entityS=("Nokia ","Apple ","Microsoft ","Google ");

 

Features (extraction patterns) are ‘Context rules’ which are applied to potential sentences, to extract a potential entity (Sachin Pawar, Rajiv Srivastava, Girish Keshav Palshikar, 2010).

For example, a feature context_previous_next , when applied to a sentence , ‘I  work for Nokia  since 1998’, extracts context_for_since. The potential entity (Nokia in this case), is added to the candidate word pool. The extracted pattern (context_for_since) is added to the pattern pool. Whether these are selected or not, depends on their score.

I used a simple feature list,

 

@featureList= ("CONTEXT_PREV_NEXT","FW_WORD","LW_WORD");

 

The algorithm (based on Sachin Pawar, Rajiv Srivastava, Girish Keshav Palshikar, 2010):

score (c1, c2) in Perl looks like :

sub score{

return  ($c1==0)? 0 :  $_[1]/$_[0] * log($_[1]) ;

}

Since I am looking for companies, potential phrases essentially include a series of capitalized words (can safely ignore phrases that do not fulfill this requirement). I used a regular expression to find such phrases:

sub returnPhrase{

 

if ( $_[0] =~ m/([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)/) {

return $&;

}

 

return '-1' ;

}

Procedure that applies the context_previous_next pattern on a potential phrase:

sub applyContextRule {

 

if( $_[0] =~  m/([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)/ ) {

 

  @prev=split(" ",$`);

  $prev=pop @prev;

  @next=split(" ",$');

  $next=shift @next;

 

  my $actualPhrase=$&;

  $featureIns=$featureList[0];

  $featureIns=~s/PREV/$prev/;

  $featureIns=~s/NEXT/$next/;

 

  return ($actualPhrase,$featureIns);

}

 

return ('-1','-1');

}

Basilisk works very well, and the gazette is complete within a few iterations.

 

The next step is NER. (Pawar, Srivastava, Palshikar, 2010) Proposed a fast algorithm which significantly outperforms a naïve approach based on regular expressions. I have a slightly different take on it. Will post the complete implementation (Perl) and results soon!