Thursday, May 1, 2014

HTML en un arbol

El parseo de paginas web con las librerias htmlparse y html2tree, usa las "rutas" a los puntos de interés dentro de la estructura jerárquica del archivo html.
En el wiki de Tcl, descubrí un código ejemplo llamado "LemonTree" el cual generaliza el llenado de un control BWidget tipo arbol con elementos de una jerarquía.
Ejemplos de estructuras de datos jerárquicos son: el sistema de archivos, los controles en una interface gráfica hecha con Tk, los comandos y las variables en los "namespaces",etc.

Por otro lado, LemonTree explica claramente como agregar otras estructuras jerárquicas a este control, lo cual viene perfecto para el caso de los tags dentro de un archivo html.
Una vez que se puede explorar el contenido del archivo html en un control de este tipo, se está en la posibilidad de utilizar este script a manera de herramienta para facilitar la determinacion de la ruta a los elementos de interés.
En esta ocasión, les presento el código al cual le llame' "LemonTree Branch" en la publicación original, para hacer referencia al script en el que esta basado.
Mañana, veremos un ejemplo de parseo de una pagina web, para lo cual, utilicé este script para encontrar la mencionada "ruta" el punto de inicio de los elementos que me interesa extraer.

Como se usa:
1) Si se tiene el archivo html almacenado de manera local.
a) Navegar a la ubicación del archivo haciendo click en el botón "Browse..."
b) Hacer click en el botón "html > tree"
c) Empezar la exploración del árbol haciendo click en los nodos

2) Si el archivo es una pagina web estándar.
a) Copié la dirección (URL) desde el navegador web a la caja de texto
b) Hacer click en el botón "html > tree"
c) Empezar la exploración del árbol haciendo click en los nodos

En la consola, se puede observar la secuencia de ruta, desde el punto seleccionado hasta la raíz. Vale la pena aclarar, que dicha secuencia, normalmente se requiere en el sentido contrario, o sea, desde la raíz, hasta el punto deseado.



package require struct
package require csv
package require report
package require htmlparse
package require textutil
package require http
package require BWidget
namespace eval LemonTree {variable uniqueID 0}

#http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html

if 0 {The Tree widget allows a -data item for each node, which I use for a dict-like
list that contains the node's type and "real name" (as opposed to the "display name"
- for instance, a dir node would display only its [file tail], but the real name is
the full path). This routine adds a node to the LemonTree: }

 proc LemonTree::add {w parent type name {text ""}} {
    variable uniqueID; variable icon
    if {$parent != "root"} {
      set val1 [::t get $name data]
      set val2 [::t get $name type]
      set val3 [::t index $name]
      if {$text eq ""} {set text "$val3: <$val2> [string range $val1 0 30]"}
      #if {$text eq ""} {set text "$val3,$val2,$val1"}
    }
    set id n[incr uniqueID]
    set data [list type $type name $name]
    #tk_messageBox -message "$type,$name"
    set fill [expr {[string match (* $text]? "blue": "black"}]
    set drawcross [expr {[info proc ::LemonTree::kids($type)] eq ""?
         "never": "allways"}]
    $w insert end $parent $id -text $text -data $data -drawcross $drawcross -fill $fill
    if [info exists icon($type)] {
             $w itemconfigure $id -image $icon($type)
    }
 }
if 0 {For speed, a Tree isn't drawn fully expanded at the beginning.
Instead, nodes are opened on demand, when the user clicks on the [+] icon.
I use the -drawcross "allways" mode (shudder - should be fixed to "always",
but then older code might break) to indicate that the node hasn't been opened before
- after the first opening, the mode is set to "auto", meaning to draw a cross only if the node has children. }

proc LemonTree::open {w node} {
    if {[$w itemcget $node -drawcross] eq "allways"} {
        set data [$w itemcget $node -data]
        set type [dict'get $data type]
        foreach {ktype kids} [kids($type) $w $node] {
            foreach kid $kids {add $w $node $ktype $kid}
        }
        $w itemconfigure $node -drawcross auto
    }
 }

if 0 {So far for the generic LemonTree - the rest is already customization for specific item types.
The kids($type) call above looks like an array element 
- in fact it's a way of dispatching the generic operation of providing the list of children
of an entity of given type, which of course depends on the type. For instance, the children
of a directory are its subdirectories, and then its files (with special-casing for Windows,
so that drive letters are the children of "/"): }

 proc LemonTree::kids(html) {w node} {
    set name [dict'get [$w itemcget $node -data] name]
    list html [::t children $name]
 }

if 0 {A Tree looks prettier if nodes have icons, so I'm using some of those that BWidget comes with:}
set path $BWIDGET::LIBRARY/images

foreach {type name} {dir folder file file array copy html folder} {
   set LemonTree::icon($type) [image create photo -file $path/$name.gif]
}

if 0 {This thing is more useful if you can get more information about an item by clicking on it
- for a file, its size and date; for a variable, its value; for a proc, its full specification, etc.
As a small first shot, I selected a "balloon" for that purpose. }

proc LemonTree::Info {w node} {
    set type [dict'get [$w itemcget $node -data] type]
    if {[info proc ::LemonTree::info($type)] ne ""} {
        balloon $w [info($type) $w $node]
    }
 }
 
#-- type-specific info providers:

 proc LemonTree::info(html) {w node} {
    #puts $node
    set name [dict'get [$w itemcget $node -data] name]
    if {$name != "root"} {
      set val1 [::t get $name data]
      set val2 [::t get $name type]
      set val3 [::t index $name]
      puts $node
      puts "\t[t index $name]"
      foreach nodo [t ancestors $name] {
       if {$nodo != "root"} {
         puts "\t[t index $nodo]"
       }
      }
    } else {
     set val1 "root"
    }
    return "$val1"
 }
 
#-- A simple ballon, modified from Bag of Tk algorithms:  
proc balloon {w text} {
    set top .balloon
    catch {destroy $top}
    toplevel $top -bd 1
    pack [message $top.txt -aspect 10000 -bg lightyellow \
        -borderwidth 0 -text $text -font {Helvetica 9}]
    wm overrideredirect $top 1
    wm geometry $top +[winfo pointerx $w]+[winfo pointery $w]
    bind  $top <1> [list destroy $top]
    raise $top
 }

if 0 {From Tcl 8.5, one would use a real dict, but it's easy to make a replacement
that works roughly the same in 8.4 (it returns "" for non- existing keys instead of throwing an error),
and might be slower, but I won't notice on dicts with two elements ;-}

proc dict'get {dict key} {
    foreach {k value} $dict {if {$k eq $key} {return $value}}
 }
#-- reconstruct a proc's definition as a string:
proc procinfo name {
    set args ""
    foreach arg [info args $name] {
        if [info default $name $arg def] {lappend arg $def}
        lappend args $arg
    }
    return "proc $name {$args} {[info body $name]}"
 }
 
#
#
#
proc main {} {
  set url [.txt get]
  catch {t destroy}
  .t delete [.t nodes root]

  if {$url == ""} {
  tk_messageBox -message "specify html location"
  return
  }
  ::struct::tree t
  if {[string range $url 0 3] == "http"} {
    set http  [::http::geturl $url]
    set html  [::http::data $http]
  } else {
    set html [read [set fh [open $url]]]
    close $fh
  }
  
  #puts $url
  htmlparse::2tree $html t
  #htmlparse::removeVisualFluff t
  #htmlparse::removeFormDefs t
  LemonTree::add .t root html    root  "(html)"
  
return
}
 
# -------------------------------------------
label .lbl -text "URL or path:"
entry .txt -width 60
button .btnBrowser -text Browse... -command {.txt insert end [tk_getOpenFile]}
button .btnGet -text "html > tree" -command main
button .btnSalir -text Exit -command {
catch {t destroy}
exit
}
grid .lbl .txt .btnBrowser
grid .btnGet -row 1 -column 1
grid .btnSalir -row 1 -column 2

#-- Now to demonstrate and test the whole thing: 
Tree .t -background white -opencmd {LemonTree::open .t} \
-width 60 -height 30 -yscrollcommand {.y set}
 .t bindText  <1> {LemonTree::Info .t}
 .t bindImage <1> {LemonTree::Info .t}
 
scrollbar .y -command {.t yview}

grid .t -row 2 -column 0 -columnspan 3 
grid .y -row 2 -column 3 -sticky ns
.txt insert end "http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html"
#-- Little development helpers:
bind . <Escape> {exec wish $argv0 &; exit}
bind . <F1> {console show}

No comments: