Account: (login)

Are you the publisher? Claim this channel

Search in 125,997,498 RSS articles:

Page 1 | 2 | newer | newest

Latest Articles in this Channel:

  • 05/24/11--05:47: Comment on Wegman and the Ankle-Biters by RomanM (chan 2237999)
  • Nick, the “difference” comes from reading two separate articles by the same author, Dan Vergano. In the <a href="http://www.usatoday.com/weather/climate/globalwarming/2011-05-15-climate-study-plagiarism-Wegman_n.htm" rel="nofollow">one dated May 15th</a>, he writes. <blockquote>Wegman blamed a student who "had basically copied and pasted" from others' work into the 2006 congressional report, and said the text was lifted without acknowledgment and used in the journal study. "We would never knowingly publish plagiarized material" wrote Wegman, a former CSDA journal editor.</blockquote> I believe that Carrick would have been referring to this statement. You will notice that the words “into the congressional report” are not in quotation marks in the article so they appear to have been paraphrased by Vergano. However, you are referring to the article that <a href="http://content.usatoday.com/communities/sciencefair/post/2011/05/retracted-climate-critics-study-panned-by-expert-/1" rel="nofollow">he wrote on May 16th</a> (and updated the following day). The full quote from the email from Ms. Reeves clearly states that she contributed a draft document to Prof. Wegman’s request which he later included in the report: <blockquote>I was Dr. Wegman's graduate student when I provided him with the overview of social network analysis, at his request. My draft overview was later incorporated by Dr. Wegman and his coauthors into the 2006 report. I was not an author of the report. The format of the 2006 report involved a limited amount of citations. <em>The social network material that I provided to Dr. Wegman followed the format of the report.</em> (emphasis hers)</blockquote> Her justification for excluding references was that the material “followed the format of the report” and in this she appears to be partly correct. From my own experience, requisitioned <em>reports</em> outside of academic scholarly work (whether written by academics or not) do not usually provide line-by-line referencing. If it is important to relate a specific statement to its source, then an exception would be made. Otherwise, all of the references would be given either at the end of each section or in a single location within the report document. However, the problem appears to be that she may likely have neglected to give <em>any</em> references at all leading Wegman to believe that the “overview” was completely her own work. You will notice that Prof. Wegman states that “We would never knowingly publish plagiarized material”. As far as “author status” goes, it has not been my experience of any reason (ethical or otherwise) that everyone contributing to a <em>report</em> need necessarily be acknowledged as a contributor. However, as Carrick states in his comment, this traditionally does not carry over to published academic work where the ownership of intellectual content plays a much more central role.

  • 05/24/11--07:14: Comment on Wegman and the Ankle-Biters by Nick Stokes (chan 2237999)
  • I think the difference is classic he said...she said... She says that she didn't write part of the report - she wrote an overview which others included in the report. She describes her version as a draft. It seems unclear whether she wrote it as something that she expected to be included in the report. The first para suggests no. The reference to the format suggests yes, but isn't conclusive.

  • 05/24/11--08:14: Comment on Wegman and the Ankle-Biters by RomanM (chan 2237999)
  • Nick, your hair-splitting obfuscation is incredible. She states that she is not an <em>author</em>. That is a well-defined term meaning that her name is not on the report listing her as a person who is responsible for the end product. It does not in any way suggest that she had no expectation that the material would be included in the final document. Look at the facts: She produced her portion in response to someone's request - it was not just "volunteered" content. In your own words, "She describes her version as a draft" - for what purpose would changes might be made later to that material if NOT the expectation that it would be used in print form? . She specifically follows the format which is used in the report - a format she is well aware of <em>before</em> she hands in the draft</em>. Yet, you can still say with a straight face: <em>It seems unclear whether she wrote it as something that she expected to be included in the report.</em> !!! Appropriate words elude me ...

  • 05/24/11--09:50: Comment on Wegman and the Ankle-Biters by Nick Stokes (chan 2237999)
  • Roman, Yes, it is unclear in the first para. She says the draft was <i>later</i> incorporated. The fact that she spelt that out suggests to me that she didn't write it with that destination in mind. And the fact that she emphasises that she wasn't an author reinforces that. But in any case, as Carrick said, placing the blame on someone you weren't prepared to credit with authorship is pretty feeble.

  • 05/24/11--10:45: Comment on Wegman and the Ankle-Biters by RomanM (chan 2237999)
  • Nick, it certainly could NOT have been incorporated <em>earlier</em>. Here, the word "later" is an uninformative <em>filler</em> word , not a word intended to stress some specific aspect of the situation ( as for example, <em>much later </em> might do). The rest of her statement just doesn't make sense if it had been unintended for inclusion in the report. Equus est mortuus...

  • 05/24/11--12:02: Comment on Wegman and the Ankle-Biters by Nick Stokes (chan 2237999)
  • <i>"uninformative filler word "</i> How do you know? I think she meant it.

  • 07/02/11--19:53: Comment on Spreading the Warmth Around by RegEM Impact on Peninsula Correlations « Climate Audit (chan 2237999)
  • [...] Pole and how its temperatures correlate with the rest of the grid points. These can be found at my statpad site . The R script can be found in a Word document here. This entry was written by RomanM, [...]

  • 07/23/11--08:06: Comment on Combining Stations (Plan C) by Roman M’s anomaly combination incorporated into R « the Air Vent (chan 2237999)
  • [...] long time readers know, I’m a fan of Roman’s temperature combination method which doesn’t require a base period window to offset individual station anomalies in global [...]

  • 10/17/11--10:08: Comment on Comparing Single and Monthly Offsets by The Blackboard » Another land temp reconstruction joins the fray (chan 2237999)
  • [...] method is similar to that of Nick Stokes and Jeff Id/Roman M, in that they all used the Tamino and Roman method of computing a monthly offset for each station such that the sum of the squared differences [...]

  • 11/11/11--11:39: Comment on EIV/TLS Regression – Why Use It? by Hu McCulloch (chan 2237999)
  • Here is some discussion of TLS from the thread "Un-Muddying the Waters" (11/7/11) over on CA. It's more OT over here. </blockquote> Hu McCulloch Posted Nov 10, 2011 at 3:55 PM | Permalink | Reply Roman — I can’t say that I have ever actually used TLS, but from reading up some on it after discussions here on CA, it sounds to me like a very reasonable way to handle the errors-in-variables problem. However, the elementary treatments I have seen just assume that the variances of all the errors are equal. In fact, they ordinarily wouldn’t be equal, and in order for the method to be identified, you have to know what their relative size is (or what the absolute size is on one side). Then you can rescale the variables so that the errors are equal, and use the elementary method. This gives you “y on x” in the limit when you know that x has no measurement error, and “x on y” in the limit where you know that x has measurement error but there are no regression errors. Is there a standard way to compute standard errors or CI’s for the coefficients in TLS? I haven’t seen that. (Since the measurement-error-only case corresponds to the calibration problem, the CI’s may be nonstandard). Another issue that is glossed over is the intercept term — most regressions include an intercept, which is the coefficient on a unit “regressor”. However, there is never any measurement error on unity, so it has to be handled differently. A quick fix-up is just to subtract the means from all variables, which forces the regression through the origin, and then to shift it to pass through the variable means instead. But then we still need an estimate of the uncertainty of the restored intercept term. The widely used econometrics package EViews doesn’t seem to have TLS. An Ivo Petras has contributed a TLS package to Matlab File Exchange. It may work, but as such it has no Matlab endorsement. (I recently found a very helpful program to solve Nash Equilibria there, but it only worked after I corrected two bugs.) Do you approve of TLS? RomanM Posted Nov. 10, 2011 at 5:41 PM | Permalink | Reply Hu, basically you have it right. The usual approach is to center all of the variables involved at zero (the line can be moved by de-centering the result later) and as Steve says an svd procedure applied to estimate the coefficients. In the case of two variables, there is an explicit solution. Inference on the result would be difficult and resampling or asymptotic large sample results would likely be the only type of inferential methodology available. I have not bothered to research what these methods might be. I have some real concerns that the methods don’t really properly take into account the individual “errors” for the predictors or the responses. For each observation, there is in effect a single “residual value” which is apportioned to each of the variables (in exactly the same proportions for each observation) where the apportioning is determined by the fitted line (or plane as the case may be). I wrote up some criticism of the procedure at my blog a year ago. If you want a simple R function to do TLS, you can try this: reg.orth = function(ymat, xmat) { nx = NCOL(xmat); ny = NCOL(ymat) tmat = cbind(xmat,ymat) mat.svd = svd(tmat) coes = -mat.svd$v[1:nx,(nx+1):(nx+ny)] %*% solve(mat.svd$v[(nx+1):(nx+ny),(nx+1):(nx+ny)]) pred = xmat %*% coes list(coefs=coes,pred=pred)} ymat and xmat are matrices containing the response variables and the predictor variables, respectively. They should be decentered before use. The output is the regression coefficients and the predicted values for ymat. The residuals can be calculated by subtracting the latter from ymat. In the case of a single variable for each, the coefficient is the slope and the predicted values form a straight line. Do I “approve” of the procedure? I guess there are probably some uses, but IMHO, I don’t see that it really handles the problem that all of the variables can contain uncertainty in any reasonable fashion. Hu McCulloch Posted Nov 11, 2011 at 11:49 AM | Permalink | Reply Thanks for the link to your blog page, Roman. Since this is getting OT here, we should move this discussion there. But as you note in one of your comments on your page, the variables must be scaled so that their errors have equal variance before running standard TLS. If this is not done, measuring temperature in F rather than C changes the results, and equalizing the variances of the variables themselves can give absurd results. More over there later… </blockquote>

  • 11/30/11--22:47: Comment on GHCN and Adjustment Trends by P. Solar (chan 2237999)
  • Using the word sourcecode in square brackets should allow posting this sort of code. Here goes: [sourcecode] ##################### # read data from two files which have been downloaded from # http://www1.ncdc.noaa.gov/pub/data/ghcn/v2/ # and decompressed by an external program #v2.mean.Z #v2.mean.adj.Z v2.mean = readLines("v2.mean") v2.madj = readLines("v2.mean_adj") length(v2.mean) # 595759 length(v2.madj) # 422373 #last ten lines of adjusted file are identical and contain no information #remove 9 of them v2.madj = v2.madj[1:422364] #identify matching station and year lines in both sets #extract identifying info idv2 = substr(v2.mean,1,16) idv2adj = substr(v2.madj,1,16) sum(idv2[-length(idv2)] > idv2[-1]) #0 sum(idv2adj[-length(idv2adj)] > idv2adj[-1]) #0 #check to see if both setrs are in alphabetical order #if so the pairing process is faster #function to pair lines reconcile= function(dat1,dat2) { leng1 = length(dat1) leng2 = length(dat2) id.pos = rep(NA, leng2) curr = 1 for (i in 1:leng2) { j = curr while (dat2[i] >= dat1[j]) {j=j+1} if (dat2[i]==dat1[j-1]) { id.pos[i]=j-1 curr = j}} id.pos } inds = reconcile(idv2,idv2adj) #check to see if there are adjusted lines without originals in the raw data #remove if necessary sum(is.na(inds)) #31 v2.madjx = v2.madj[-which(is.na(inds))] indsx = inds[-which(is.na(inds))] v2.meanx = v2.mean[indsx] idv2x = idv2[indsx] idv2adjx = idv2adj[-which(is.na(inds))] identical(idv2x,idv2adjx) # TRUE #function to calculate individual monthly differences diff.calc = function(dat1,dat2) { len = length(dat1) outmat = matrix(NA,len,13) st = 17 + (5*(0:11)) en = st+4 x1 = x2 = rep(NA,12) for (i in 1:len) {chx1 = dat1[i] chx2=dat2[i] outmat[i,1] = as.numeric(substr(chx1,13,16)) if (outmat[i,1] != as.numeric(substr(chx2,13,16))) return("Error") for (j in 1:12) { x1[j] = as.numeric(substr(chx1,st[j],en[j])) x2[j] = as.numeric(substr(chx2,st[j],en[j]))} x1[x1==-9999]=NA x2[x2==-9999]=NA outmat[i,2:13] = (x2-x1)/10} outmat} #adjustment = adjusted - unadjusted adjs = diff.calc(v2.meanx,v2.madjx) #some statistics 12*422342 # 5068104 total number of monthly values sum(is.na(adjs[,-1])) # 205985 (4.06%) NAs sum( adjs[,-1]==0,na.rm=T) # 1631153 (32.18%) unadjusted values #calculate annual average for each station in a given year year=adjs[,1] ann.mean = rowMeans(adjs[,2:13],na.rm=T) #calculate average of all adjustments in a given year annadj = data.frame(year,ann.mean) aveadj = c(by(annadj,annadj$year, function(x) mean(x$ann.m))) plot(year,ann.mean,cex=.25,main = "Annual Averages for Individual Stations", xlab="Year", ylab="Degrees (C)" ) plot(as.numeric(names(aveadj)),aveadj, main = "Mean Annual GHCN Adjustment", xlab = "Year",ylab = "Degrees (C)") [/sourcecode]

  • 12/01/11--06:01: Comment on GHCN and Adjustment Trends by RomanM (chan 2237999)
  • P. Solar: Thanks for the information. Actually, this post is almost two years old and I have used the "sourcecode" tag in some later threads, e.g.<a href="http://statpad.wordpress.com/2010/03/29/will-the-real-rapid-city-please-stand-up/" rel="nofollow">here</a>. It does make it easier for the reader to copy the code, because a mouseover produces a floating window in the upper right portion of the text and this allows the reader to copy all of the code with a single click without having to select that code first. However, long scripts can be somewhat bulky and interfere with the "flow" of a post so that sometimes it might be preferable to put them in a separate file.

  • 12/01/11--07:20: Comment on GHCN and Adjustment Trends by KevinUK (chan 2237999)
  • PSolar, May I ask how you came across this thread? As RomanM says it's almost two years old and is IMO a seminal thread as it subsequently sparked off a lot of activity by the 'Blackboard crew' as I call them (zeke h, nick s, r broberg, moshpit, the ccc guys etc) to attempt to refute Roman's analysis here. Have you read this thread and if not please do so. For example http://statpad.wordpress.com/2009/12/12/ghcn-and-adjustment-trends/#comment-195 and hopefully you'll agree that despit ethe fact that a further two years have expired the GHCN database is still a mess. BEST haven't improved the situation in any real way and in fact, if anything they've WORST it. Now whatever happened to Giorgio Gilestro? Most of the people contributing to this thread are still around and stiil post regularly on various blogs (particlarly CSIRO Mannian hockey-stick apologist Prof. Nick Stokes BSc,MSc,PhD. GG is conspicuous by his absence. KevinUK

  • 12/04/11--10:15: Comment on 2010 Spring Arctic Sea Ice Extent by Ruhroh (chan 2237999)
  • Dear Sir; Am curious about your 'statistical opinion' of the method described by Briffa in Tranche II email 3436; http://www.ecowho.com/foia.php?file=3468.txt&search=score+of+1 "we are having trouble to express the real message of the reconstructions - being scientifically sound in representing uncertainty , while still getting the crux of the information across clearly. It is not right to ignore uncertainty, but expressing this merely in an arbitrary way (and as a total range as before) allows the uncertainty to swamp the magnitude of the changes through time . We have settled on this version (attached) of the Figure which we hoe you will agree gets the message over but with the rigor required for such an important document. We have added a box to show the "probability surface" for the most likely estimate of past temperatures based on all published data. By overlapping all reconstructions and giving a score of 2 to all areas within the 1 standard error range of the estimates for each reconstruction , and a score of 1 for the area between 1 and 2 standard errors, you build up a composite picture of the most likely or "concensus" path that temperatures took over the last 1200 years (note - now with a linear time axis). This still shows the outlier ranges , preserving all the information, but you see the central most likely area well , and the comparison of past and recent temperature levels is not as influenced by the outlier estimates. What do you think? We have experimented with different versions of the shading and this one shows up quite well - but we may have to use some all grey version as the background to the overlay of the model results." Probably it is a better use of your life force to consider that entry into the 'reconstruction derby' of which you hinted elsewhere. Best, RR

  • 12/27/11--10:41: Comment on GHCN and Adjustment Trends by Layman Lurker (chan 2237999)
  • Roman, I just left this comment at Lucia's citing your "Mean Annual GHCN Adjustments" graph from this post. It falls into the category of "things that make you say hmmm".

  • 12/27/11--10:45: Comment on GHCN and Adjustment Trends by Layman Lurker (chan 2237999)
  • http://rankexploits.com/musings/2011/climategate-investigation-tallblokegreg-laden-laframboise/#comment-87938

  • 01/29/12--07:20: Comment on EIV/TLS Regression – Why Use It? by Hu McCulloch (chan 2237999)
  • Roman -- I've now had a chance to look at TLS a little, and have even implemented a simple program in Matlab. Although you -- and much of the climate literature -- use "EIV" and "TLS" interchangeably, I view EIV (Errors in Variables) as a generic problem, and TLS (Total LS) as merely one proposed solution. Although TLS sounds good on paper, in practice it is almost never useful, since it requires knowing the relative variances of the two errors e and f. I was at first concerned that TLS might not even be consistent, since if one had scaled the variables so that e and f had equal variance, and then were to go back and re-estimate the variances of e and f from the TLS regression residuals, the ratio of the variances would equal the slope of the line, as shown in your figure 2 above, rather than unity. But in simulations I did with sample size 100, 1000, and 100,000, the estimate of the slope converged on the true value for several combinations of parameters, so this apparently is not a problem. Wikipedia and a helpful 1996 article by P. de Groen I found online do not mention the issue of consistency, so a proof of this could be a nice paper for someone (not me!). A far more useful solution to the EIV problem is what might be called Adjusted OLS: As is well known (see Wiki or Pindyck and Rubinfeld's text), the standard OLS estimator bOLS = cov(x,y)/var(x) has plim beta var(x*)/var(x). But this implies that the adjusted OLS estimator bADJ = bOLS var(x)/var(x*) = cov(x,y)/(var(x) - var(f)) is consistent. The multivariate case is similar, I believe. For some reason, neither Wiki nor deGroen nor P&R suggest this obvious adjustment. In climate data, for example, it is common to use a temperature series like HadCRU as the explanatory variable (when properly using CCE to calibrate a proxy, for example). But CRU admits this has a measurement standard errror of .10 (in 1850) to .025 (since 1950), or about .05 on average. Meanwhile, the series itself has risen by about 1 dC since it started, so that it has a variance of about 0.1. This makes bOLS low by a factor of about 0.975, so that bADJ = bOLS * 1.026. Not enough to worry about, as it turns out, but worth checking. (CRU may in fact be overstating the precision of its temperature indices, but that is another matter.) If var(x) turns out to be less than var(f), that just means that whatever variation there is in x is just noise, and one shouldn't even bother running the regression. Another use for Adjusted OLS is in 2 Stage LS IV estimation -- the first stage regression gives a noisy but exogenous proxy for the endogenous regressor(s), but it also gives an estimate of the variance of the noise. The second stage regression then can be boosted with bADJ to reduce its finite sample bias. (In IV, the first stage noise goes to zero as the sample size gets infinite, so that 2SLS is consistent despite the EIV bias. But still it's worth removing as much of the finite sample bias as possible with bADJ, particularly when the instruments are weak, as is often the case.) In the limit when var(e) = 0, so that there is measurement error in x but no regression error in y, the natural solution is Reverse Regression -- regress x on y and invert. This is the limit TLS yields, as shown in your third graph. But although bADJ has the same plim, it will give a different actual value. Wayne Fuller has a relatively recent book on Errors in Variables problems which I suspect may use bADJ, but I only took a cursory look at it some time ago. Someone (not me) should work out the theory of standard errors for the Adjusted OLS and TLS estimators.

  • 01/29/12--10:43: Comment on EIV/TLS Regression – Why Use It? by Hu McCulloch (chan 2237999)
  • Here's my 3-line Matlab script to compute univariate TLS on pre-centered data that has been normalized so that the variances of e and f are equal: % TLS0 % univariate TLS with pre-centered and pre-normalized data % (zero in name indicates restricted problem) % x, y ~ n x 1 % x = xstar + f, y = ystar + e % ystar = beta * xstar, var(f) + var(e) = min. % x and y have been pre-normalized so that var(e) = var(f) % by Hu McCulloch 1/28/12, after P. de Groen, "An Introduction to % Total Least Squares," Nieuw Archief voor Wiskunde, 1996, 14:237-53, % www.freescience.info function [beta] = TLS0(x, y) B = [x y]; [U, S, V] = svd(B, 'econ'); % B*V = U*S, S is 2x2, V is 2x2 beta = -V(1,2)/V(2,2); % This correctly gives inf if V(2,2) = 0. end % xstar and ystar are in U somewhere, but I'm not certain how to % recover them. return It's not really necessary to define "B", but I've included it for clarity.

  • 02/01/12--10:49: Comment on EIV/TLS Regression – Why Use It? by RomanM (chan 2237999)
  • Hu, your point on the difference in usage of the two terms EIV and TLS is well taken. I will start making that distinction in the future. I have looked at the situation further and did some more analysis which seemed to add further insight in regards to some of your other points: <blockquote>Although TLS sounds good on paper, in practice it is almost never useful, since it requires knowing the relative variances of the two errors e and f. </blockquote> I believe the problems with TLS to go deeper than that. If the relative variances are known, TLS can be applied after weighting the the SS components in the head post inversely to the variances of the e and f "errors" (thus reducing the problem to the equal variance case). However, the residuals for e and f are still perfectly linearly correlated although the ratio between them is changed to take the weighting into account. After reading your comments, I also tried a sequential <em>reweighted</em> LS approach as well. At each step, the variances of e and f were re-estimated and the regression repeated using the new weights. However, all of the cases I tried converged to the end result of a LS where the variable with the higher variance has its error term reduced to zero variability. I also looked very briefly at using maximum likelihood with normally distributed errors, but it wasn't clear that this would produce any better results. With regard to the consistency of the parameter estimates, my gut feeling would be that in most cases the results would be consistent although there might be demonstrable bias in the parameters (that converged to zero as the sample size increased). Where it could be bothersome is in the case that X and Y are uncorrelated with approximately equal variability. However, |I am just guessing on this aspect and it could be wrong. The possible use of Adjusted OLS is intriguing, but I would need to do some reading to say more about that. Getting insight on the behavior of the estimates and their standard errors is made more difficult because the mechanics of the calculation involve portions of the principal components of the variables - in the multivariate case particularly. Something that might make it easier to comprehend is to do the calculations recursively and examine what happens in that process. I wrote a short program for this: [sourcecode language="css"] tls.alt = function(yvec,xvec,wts = c(1,1),tol=1E-6) { #wts = (y-variable weight, x-variable weight) # typically weights would be 1/ variable variance #variables do not have to be centered dif = tol+1 ywt = wts[1]; xwt = wts[2] newx = xvec while (dif>tol) { lastx = newx regco = coef(lm(yvec~newx)) newx = (ywt*regco[2]*(yvec-regco[1]) + xwt*xvec)/(xwt+ ywt*regco[2]^2) dif = max(abs(newx-lastx),na.rm=T)} yscore = regco[1]+regco[2]*newx xres = xvec-newx yres = yvec-yscore xss = sum(xres^2, na.rm=T) yss = sum(yres^2, na.rm=T) list(coefs=regco,yscore = yscore, xscore = newx, xresid =xres,yresid=yres,xss=xss,yss=yss)} [/sourcecode]

  • 02/07/12--08:45: Comment on EIV/TLS Regression – Why Use It? by Hu McCulloch (chan 2237999)
  • <blockquote> After reading your comments, I also tried a sequential reweighted LS approach as well. At each step, the variances of e and f were re-estimated and the regression repeated using the new weights. </blockquote> How did you do this? Originally I was hoping to iteratively estimate the variance of e from the TLS residuals, starting from knowledge of var(f) and an initial guess of var(e) (say from OLS of y on x). But then I realized that TLS does not give back the true relative variance of f and e, unless the slope happens to be +/- 1 which would only be by accident. Although TLS is ML-motivated, it does not necessarily have the nice properties of ML. ML is consistent if a fixed number of parameters are being estimated with an increasing number of observations. But in TLS the number of unknowns increases 1-for-1 with the number of observations, so that x* and y* are not consistently estimated. It appears that as a consequence, the variance ratio is not consistently estimated by the residuals, even though the slope appears to be consistently estimated (from my limited simulations). Please e-mail me (and the gang) if and when you add to this interesting discussion, as it proceeds rather slowly.

Page 1 | 2 | newer | newest