Paper’s Objective

The paper adapts graph-based ranking model for extractive summarization of long scientific documents by exploiting discoures structure i.e., sections in the case of scientific documents.

Problem formulation

A document is represented as a set of nodes. Each node represents a sentence, and the similarity between two sentences is the edge weight between them. Summary generation is formulated as a node selection problem, in which nodes (i.e., sentences) that are semantically similar to other nodes are chosen to be included in the final summary. In other words, they determine node importance by defining a notion of centrality in the graph.

Central Idea

  • A simple unsupervised graph-based ranking model combined with proper sophisticated modeling of discourse information as an inductive bias can achieve unreasonable effectiveness in selecting important sentences from long scientific documents. (LexRank and PacSum are the choices of unsupervised graph-based ranking models.)

  • Important information typically occurs at the start and end of sections.

  • Most sentences across section boundaries are unlikely to interact significantly with each other. To implement this, a hierarchy is introduced in the graph. A section-level representations as graph nodes in addition to sentence nodes By doing so, we convert a flat graph into a hierarchical non-fully-connected graph, which has two advantages:

    • Reduced computational cost
    • Pruning of distracting weak

Methodology

Let us represent a document as a graph \(G=(V,E)\) where \(V\) is the set of vertices that represent sentences or other textual units in the document and \(E\) is the set of edges that represent the similarity between sentences. The directed edge \(e_{ij}\) from node \(v_i\) to node \(v_j\) is typically weighted by \(w_{ij} = f (sim(v_i , v_j))\), where \(sim\) is a measure of similarity between two nodes.

The algorithms select the most salient sentences from \(V\) based on the assumption that sentences that are similar to a greater number of other sentences capture more important content and therefore are more informative.

To create the hierarchy, we allow two levels of connections in our hierarchical graph: intra-sectional connections and inter-sectional connections.

Intra-sectional connections aim to model the local importance of a sentence within its section. It implements the idea that a sentence that is similar to a greater number of other sentences in the same topic/section should be more important. This is realized in our fully-connected subgraph for an arbitrary section I, where we allow sentence-sentence edges for all sentence nodes within the same section.

Inter-sectional connections aim to model the global importance of a sentence with respect to other topics/sections in the document, as a sentence that is similar to a greater number of other topics is deemed more important.

 

The authores introduce section nodes on top of sentence nodes to form a hierarchical graph. For inter-section connections, we only allow section-sentence edges for modeling the global information.

To calculate the weight of an edge, we first measure similarity between a sentence-sentence pair \(sim(v_j^{I} , v_i^{I} )\) and a section-sentence pair \(sim(v^J , v_i^I )\). While our method is agnostic to the measure of similarity, we use cosine similarity with different vector representations in our experiments, averaging a section’s sentence representations to obtain its own.

While the similarities of two graph nodes are symmetric, one may be more salient than the other when considering their discourse structures.

Asymmetric edge weighting over sentences is based on the hypothesis that important sentences are near the boundaries (start or end) of a text (Baxendale,1958). We reflect this hypothesis by defining sentence boundary function \(d_b\) over sentences \(v_i^I\) in section I such that sentences closer to the section’s boundaries are more important:

\[d_b(v_i^I ) = min(x_i^I , α(n^I − x_i^I ))\]

where \(n^I\) is the number of sentences in section \(I\) and \(x_i^I\) represents sentence i’s position in the section \(I\). \(α ∈ R+\) is a hyper-parameter that controls the relative importance of the start or end of a section or document.

we define the weight \(w_{ji}^I\) for intra-section edges (incoming edges for i) as:

\[w_{ji}^I= \begin{cases} \lambda_1*sim(v_j^{I} , v_i^{I} ),& \text{if } d_b(v_i^{I}) \geq d_b(v_j^{I})\\ \lambda_2*sim(v_j^{I} , v_i^{I} ),& \text{otherwise} \end{cases}\]

where \(λ1 < λ2\) such that an edge \(e_{ji}\) incident to i is weighted more if i is closer to the text boundary than j. Edges with a weight below a certain threshold \(\beta\) can be pruned.

Asymmetric edge weighting over sections similar equations are created for inter-section edges. The weight \(w_i^{JI}\) for an edge from section \(J\) to sentence \(i\) in section \(I\) is defined as:

\[w_i^{JI}= \begin{cases} \lambda_1*sim(v^{J} , v_i^{I} ),& \text{if } d_b(v^{I}) \geq d_b(v^{J})\\ \lambda_2*sim(v^{J} , v_i^{I} ),& \text{otherwise} \end{cases}\]

The same distance criteria is used for inter-section edges as well by using the position of section in the document instead of the position of sentence in the section.

Importance Calculations We compute the overall importance of sentence \(v_i^I\) as the weighted sum of its inter-section and intra-section centrality scores:

\[c(v_i^I) = μ_1 · c_{inter}(v_i^I) + c_{intra}(v_i^I )\] \[c_{intra}(v_i^{I}) = \sum_{v_j^I\in I}\frac{w_{ji}^I}{|I|}\] \[c_{inter}(v_i^{I}) = \sum_{v^J\in D}\frac{w_i^{JI}}{|D|}\]

where \(I\) is the set of sentences neighboring \(v_i^I\) and \(D\) is the set of neighboring sections in the hierarchical document graph; \(μ_1\) is a weighting factor for inter-section centrality.

Summary Generation

Lastly, we generate a summary by greedily extracting sentences with the highest importance scores until a predefined word-limit L is passed. Most graph-based ranking algorithms recompute importance after each sentence is extracted in order to prevent content overlap. However, we find that the asymmetric edge scoring functions naturally prevent redundancy because similar sentences have different boundary positional scores. Our method thus successfully extracts diverse sentences without recomputing importance.

Results and Limitations

Cons

  1. The method is limited to long documents with sections, .i.e, have a discourse structure if there are no section then we have to use some way to replace the sections with some other abstraction or we might have to go back to using the flat graph.

  2. Coherence will also be an issue as the method is extractive in nature and will not lead to a coherent flow of sentences.

  3. Domain-specific nuances might not captured by the method we might have to use modification to address this issue also it does not allow for any kind of supervision. Hence guiding the algorithm to capture domain-specific nuances would be difficult might require futher modeling.

Pros

  1. The method is unsupervised and can be applied to any domain without any need for annotated data.

  2. The method is extractive in nature and hence the summary is faithful to the original text and factual correctness of the intact.

  3. The method is competitive with the state-of-the-art abstractive summarization methods as on 2023 on arxiv Dataset and would be a strong baseline for any long summarization task with discoure structre.

Implication

  1. Extractive summarization faithfully preserves the original text hence ensuring the factual correctness of a summary to some extend. (Which is good for critical applications like medical, legal, etc.)
  2. As the method is unsupervised, it can be applied to any domain without any need for annotated data.