Large-scale SVD and subspace-based methods for Information retrieval O. Marques*, H. Simon* and H. Zha *NERSC, Lawrence Berkeley National Laboratory 1 Cyclotron Road Berkeley, CA 94720 307 Pond Laboratory Department of Computer Science and Engineering The Pennsylvania State University University Park, PA 16802-6103 Abstract. We present a theoretical foundation based on subspaces for latent semantic indexing (LSI) in information retrieval. We show that our model leads to a low-rank-plus-shift structure that is approximately satisfied by the cross-product of the term-document matrices. This structure can be exploited for the compution of the partial singular value decomposition (SVD) of a large sparse term-document matrix used in LSI. We also discuss several parallel implementation issues and present emperical numerical results on Cray T3E using text collections with millions of documents.