• English
  • Русский

Web Pages, Text Types, and Linguistic Features: Some Issues

From a textual point of view, the web is a huge reservoir of documents. On the web virtually everything can be seen as a ‘document’ or better a ‘web page’. The sheer amount of texts available is just overwhelming. Furthermore, the web is mainly wild and uncontrolled. This becomes clear if we compare a ‘tamed’ resource of the paper world, like the British National Library, and the ‘untamed’ English Web. In: this empirical study, I investigated text typologies in a random sample of raw web pages, and not in a corpus of pre-selected and pre-processed documents. I realized that the textuality of web pages might be dissimilar from the textuality of linear documents (whether paper or electronic documents). This new textuality makes automatic feature extraction and application of NLP tools more troublesome. I also realized that the text typologies already available in the literature might not cover all web page types. The issues pointed out in this study do not have an easy solution. For the time being, my suggestion is to keep them in mind when assessing results from any automatic approach to web pages.
  1. Adam J.-M. Les textes : types et prototypes. Récit, description, argumentation, explication et dialogue. Paris, Nathan, 1992.
  2. Beaudouin V., Fleury S., Habert B., Illouz G., Licoppe C., Pasquier M. Traits textuels, structurels et présentationnels pour typer les sites web personnels et marchands. 2001. Available at: http://www.atala.org/je/010428/TyPWeb.ppt.
  3. Beaudouin V., Fleury S., Habert B., Illouz G., Licoppe C., Pasquier M. TyPWeb: décrire la toile pour mieux comprendre les parcours. Colloque International sur les Usages et les Services des Télécommunications, e-Usages, no pagination, Paris, 2001.
  4. Beaugrande R.-A., Dressler W. Introduction to text linguistics. London, New York, Longman, 1981.
  5. Biber D. A typology of English texts. Linguistics, 1989, vol. 27, pp. 3-43.
  6. Biber D. Dimensions of register variation. Cambridge, Cambridge University Press, 1995.
  7. Biber D. Variation across speech and writing. Cambridge, Cambridge University Press, 1988.
  8. Biber D.Towards a typology of web registers: A multi-dimensional analysis. Invited lecture, Conference on Corpus Linguistics, Perspectives for the future. University of Heidelberg, Germany, 2004.
  9. Bouayad-Agha N., Scott D., Power P. Integrating content and style in documents: A case study of patient information leaflets. Information Design Journal, 2000, vol. 9, no. 2-3, pp. 161-176.
  10. Crowston K. Williams M. The effects of linking on genres of web documents. Proceedings of the32nd Hawaii International Conference on System Sciences, Hawaii, USA, 1999, no pagination.
  11. Douglas S., Hurst M. Layout and language: Lists and tables in technical documents. In: Proceedings ofSIGPARSE Workshop on Punctuation in Computational Linguistics. Santa Cruz, 1996, pp. 19-24.
  12. Eagles 1996. EAGLES preliminary recommendations on text typology. EAGLES Document EAG-TCWG-TTYP/P, Version of June, 1996. Available at: http://www.ilc.cnr.it/EAGLES96/texttyp/texttyp.html.
  13. Faigley L., Meyer P. Rhetorical theory and readers’ classification of text types. Text, 1983, vol. 3, pp. 305-325.
  14. Görlach M. Text types and the history of English. Berlin, New York, Mouton de Gruyter, 2004.
  15. Haas S., Grams E. Page and link classifications: Connecting diverse resources. Proceedings of Digital Libraries’98, Pittsburgh USA, 1998, pp. 99-107.
  16. Haas S., Grams E. Readers, authors, and page structure: A discussion of four questions arising from a content analysis of web pages. Journal of the American Society for Information Science, 2000, vol. 51, no. 2, pp. 181-192.
  17. Hurst M. Layout and language: Challenges for table understanding on the web. In: Proceedings of the 1st International Workshop on Web Document Analysis, no pagination, Seattle, USA, 2001.
  18. Ihlström C. Åkesson M. Genre characteristics - a front page analysis of 85 Swedish online newspapers. Proceedings of the 37th Hawaii International Conference on System Science, Hawaii, USA, 2004, no pagination.
  19. Ihlström C., Lundberg J. The online news genre through the user perspective. Proceedings of the 36th Hawaii International Conference on System Science, no pagination, Hawaii, USA, 2003.
  20. Joho H., Sanderson M. The SPIRIT collection: An overview of a large web collection. SIGIR Forum, 2004, vol. 38, no. 2, no pagination.
  21. Karlgren J. Stylistic experiments for information retrieval. Thesis Diss. Doct. Sci. (Philos.). Stockholm University, 2000.
  22. Roberts G. The home page as genre: A narrative approach. Proceedings of the 31st Hawaii International Conference on System Science. Hawaii, USA, 1998, no pagination.
  23. Santini M. Automatic Text Analysis: Gradations of text types in web pages. Proceedings of the Tenth ESSLLI Student Session, Edinburgh UK, 2005, pp. 276-285.
  24. Say B., Akman V. Current approaches to punctuation in computational linguistics. Computers and the Humanities, 1997, vol. 30, no. 6, pp. 457-469.
  25. Shepherd M., Watters C. The functionality attribute of cybergenres. Proceedings of the 32nd Hawaii International Conference on System Science Hawaii, USA, 1999, no pagination.
  26. Stubbs M. Text and corpus analysis. Oxford, Blackwell Publishers, 1996.
  27. Tapanainen P., Järvinen T. A non-projective dependency parser. Proceedings of the 5th Conference on Applied Natural Language Processing. Washington USA, 1997, pp. 64-71.
  28. Waller R. The typographic contribution to language. Thesis submitted for the degree of Doctor of Philosophy, University of Reading, UK, 1987.
  29. Werlich E. A text grammar of English. Heidelberg, Quelle and Meyer, 1976.
Full Text (PDF):
(downloads: 165)