Name: Text and Structural Data Mining of Influenza Mentions in Web and Social Media סיכום אנגלית • אוצר אקדמי
SKU: f784f05d02ff
Price: 295.00 ILS
Availability: InStock

Text and Structural Data Mining of Influenza Mentions in Web and Social Media

1. Introduction

Influenza diagnosis based solely on the presentation of symptoms is limited as these symptoms may be associated with other diseases. Many cases of influenza remain undiagnosed. While the presence of influenza in an individual can be confirmed through specific diagnostic tests, the influenza prevalence in the population at any given time is unknown and can only be estimated. In the past, such estimates have relied solely on the extrapolation of diagnosed cases, making it difficult to identify the various phases of seasonal influenza or to identify a more serious manifestation of a flu epidemic.

Web and social media (WSM) provide a resource to detect increases in ILI (influenza-like illness). This paper evaluates blog posts, a type of WSM, that discuss influenza and the analyses show a significant correlation with patient reporting of ILI during the US 2008−2009 influenza season. We briefly discuss a history of infectious disease outbreaks and recent approaches in online public health surveillance of influenza. We discuss the value of social community with regard to outbreak. We present comprehensive analysis, covering 24 months of data. We suggest a possible response that identifies WSM influenza-related communities that share flu-related postings. Strongly connected communities are evaluated and influential bloggers identified that should be part of a WSM outbreak response. We leverage graph-based data mining to identify structural anomalies in the flu blogosphere that correspond to increases in ILI. We envision several applications that leverage automatic open source document analytics for biosurveillance: provide lagging indicators of a disease outbreak to a component of a US port and border’s biosurveillance system.

2. Data and Methods

Spinn3r is a WSM indexing service that conducts real-time indexing of all blogs, with a throughput of 100,000 new blogs indexed per hour. Metadata of this data set includes the following: blog title, blog URL, post title, post URL, date posted, description, full HTML encoded content, subject tags annotated by author, and language. Data were selected from a period of 24 weeks, from 5 October to 21 March 2009. A majority of the articles we analyzed were weblogs; mainstream media accounts for 20% of the data and the remaining types include forums and classified ads. Indexing, parsing, and link extraction code was written in Python. This compute resource is housed at the University of North Texas Center for Computational Epidemiology and Response Analysis.

Influenza WSM item trends can be monitored using the social media mining methodology. This methodology facilitates identification of outbreaks and increases of influenza infection in the population. We posit a strong correlation exists between the frequency of FC (flu-content) posts per week and CDC (Centers for Disease Control and Prevention) ILI surveillance data. Qualitative assessment of category tags, prevalence of FC-posts on a blog site, and persistent posting of flu-related posts also suggest ILI trends. We hypothesize that the frequency of blog-world flu posts correlate with a patient reporting ILI and the US flu season. To verify this hypothesis, we compare our data to CDC surveillance reports from sentinel healthcare providers.

WSM communities will play a vital role in any public health response to an outbreak. The influential bloggers could be first responders to a disease outbreak. The readers will trigger an information cascade, spreading public health communications (to vaccinate, quarantine). A WSM targeted response must be cost-effective and optimized to achieve maximum strategy penetration. Closeness and betweenness centrality measures and Google’s PageRank (eigenvector centrality) will rank influenza community blog sites in order to target key actors. The Girvan-Newman community finding algorithm will identify communities of interest.

Graph-based algorithms can be leveraged to identify communities and facilitate bio-event detection by searching for anomalies in the link-structure of WSM. We introduce a method for discovering substructures in structural databases implemented in Subdue. Subdue is devised for general-purpose automated discovery, concept learning, and hierarchical clustering. The method can be applied to many structural domains. Subdue is leveraged in our analysis to identify non-obvious patterns in blog posts that may serve as lagging-indicators of an influenza outbreak.

Formal study is needed to verify the accuracy of self-reported diagnoses and behaviors in WSM.

3. Results and Discussion

The CDC ILINet (Influenza-like-illness Surveillance Network) surveillance and FC-post per week data are plotted in. To prove our hypothesis that a correlation exists between CDC ILINet reports and mined WSM FC-post frequency, Pearson’s correlation statistic is evaluated between the two data series. After close inspection of the data provided by Spinn3r, we identified a significant increase in blog coverage resulting from the success of their service and subsequent expansion of web crawlers. Graph-base data mining discovered substantial presence of MySpace blogs in the last three weeks of data. We manually inspected the blogs and discovered many of the MySpace blogs were discussing the health of American Idol contestants, several of whom were sick with the flu. A link graph constructed from the blogger source URL and out-links from the influenza posts, removing self-references and parallel out-links and the largest weak component producing an aggregate graph of 694,388 nodes (bloggers) and 3,529,362 directed edges (unique blogger to blogger links). To advance our approach, we target strongly connected components within our flu link graph community identification. We cluster the second largest strongly connect component, which consists of 2,306 blogs, 26,768 edges, and an average degree of 23. The Girvan-Newman community finding algorithm (recursively removes the node with the highest betweenness centrality) identifies 11 communities. Detecting anomalies in various data sets is an important endeavor. We define an anomaly as a surprising or unusual occurrence. Using statistical approaches has led to various successes such as detecting computer and network intrusions.

4. Methods and Materials

State closeness can be productive in communicating information to other actors. It is defined in Equation 1. Betweenness centrality measures interpersonal influence (Equation 2). Page Rank is an example of eigenvector centrality and measures the importance of a node by assuming links from more central nodes contribute more to its ranking than less central nodes (Equation 3). We take an intuitive and simple definition of WSM community and identify possible first responder bloggers by link analysis. Blog ranking enhances the idea that these communities can disseminate information as part of a broader public health response triggered by anomalies in ILINet and WSM surveillance. The general form of this community structure finding algorithm is enumerated. Subdue’s discovery algorithm and an example are described.

5. Future Work

Future work will quantify the impact and validate the use of WSM to monitor seasonal influenza epidemics and global pandemics. Geo-location tagging is now implemented in blog, social network, and micro-blogging platforms and future research will leverage this new data in the next-generation WSM biosurveillance system. Identifying the perspective of influenza keyword posts facilitates determining its contribution to disease surveillance.

6. Conclusions

A framework of complementary data-mining methods is suggested. We evaluate blog posts containing influenza topic keywords through text, link, and structural data mining. Results from analysis show strong co-occurrence of flu blog posts during the US 2008−2009 flu season. Frequency of flu posts per blogger follows a heavy-tailed distribution. We show through graph metrics that the most prolific bloggers are not the most influential.

The Girvan-Newman algorithm is leveraged to identify clusters of similar sites as potential target communities for online health information campaigns. The results show distinct WSM communities clustered by publisher and content type. We apply a graph-based data mining technique, Subdue, to detect anomalies and informative substructures among flu blogs connected by publisher type, links, and user-tags. Graph-based data mining can identify significant anomalies in flu blogs that were not identified through text analysis and can be further investigated by an analyst.

העדכון אחרון:	פברואר 7, 2025
שוחרר:	נובמבר 24, 2020

כניסת מפרסמים

(20/06/2026) עלו היום לאתר 9 סמינריונים 2 תזות 2 מאמרים

חיפוש חדש

Specification

RESISTANCE IN GROUP PSYCHOTHERAPY

Immune recognition of somatic mutations leading to complete durable regression in metastatic breast cancer

A VAR MODEL AS RISK MANAGEMENT TOOL AND RISK ADJUSTED PERFORMANCE MEASURES

The goals of vocabulary learning (הצעה לתרגום חלק מתוך ספר)

Exploring Second Language Vocabulary Learning in ESL Classes | מחקר על האופן שבו נרכש אוצר-מילים בשפה שנייה במסגרת שיעורי ESL (הצעה לתרגום מאמר)

Academic optimism, hope and zest for work as predictors of teacher self-efficacy and perceived success

סיוע בכתיבת עבודה מקורית ללא סיכונים מיותרים!